Launch snapshot

author: Will DePue <williamd@openai.com> 2026-03-18 09:32:01 -0700
committer: Will DePue <williamd@openai.com> 2026-03-18 09:32:01 -0700
commit: a15093adad328a650d421e53c078cbd2c45beb0e (patch)
tree: e054c4bde12b89e6d3b39d611d9caadabc7f7234
21 files changed, 9200 insertions, 0 deletions
diff --git a/.gitignore b/.gitignore
new file mode 100644
index 0000000..3423c41
--- /dev/null
+++ b/.gitignore
@@ -0,0 +1,11 @@
+data/tokenizers
+__pycache__/
+.DS_Store
+modded-nanogpt/
+modded-nanogpt
+data/datasets
+data/manifest.json
+data/docs_selected.jsonl
+.mypy_cache/
+.venv
+logs/
+\ No newline at end of file
diff --git a/LICENSE b/LICENSE
new file mode 100644
index 0000000..4243cb4
--- /dev/null
+++ b/LICENSE
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2026 OpenAI
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
diff --git a/README.md b/README.md
new file mode 100644
index 0000000..1c9739b
--- /dev/null
+++ b/README.md
@@ -0,0 +1,186 @@
+<img width="3840" height="1280" alt="1920x640-discord" src="https://github.com/user-attachments/assets/90607b26-171f-476a-90ae-69b9dbb7cb30" />
+
+<br>
+<br>
+
+**OpenAI Model Craft Challenge: Parameter Golf** is a challenge to train the best language model that fits in a 16MB artifact and trains in under 10 minutes on 8xH100s, evaluated by compression on the FineWeb validation set (tokenizer-agnostic, bits per byte).
+
+This challenge is heavily inspired by the [NanoGPT Speedrunning](https://github.com/KellerJordan/modded-nanogpt) challenge, where participants compete to train a model that reaches 3.28 FineWeb validation loss as quickly as possible. We're excited to see how optimizing for a parameter-constrained setting pushes people toward unique architectures (test-time compute, aggressive parameter tying, depth recurrence, low-rank training, ...), compression schemes (low precision, QAT, bitnets, novel tokenizers, ...), and other creative submissions (test-time training, long context, megakernels ...). 
+
+If you're familiar with [neural scaling laws](https://arxiv.org/abs/2001.08361), you can consider this challenge a form of L(N) optimization, where the objective is to optimize the lowest loss given a fixed number of parameters (N) unconstrained by data, compute, steps, or architecture. Challenges like the [NanoGPT Speedrun](https://github.com/KellerJordan/modded-nanogpt), which optimizes for a form of L(T) (~lowest loss given constrained time) or the [NanoGPT Slowrun](https://github.com/qlabs-eng/slowrun), which optimizes for L(D) (lowest loss given constrained dataset size), can be thought of as equivalent challenges in this family.
+
+Ideally, we'd allow for submissions to use arbitrary computational resources. But in order to make the challenge not inaccessibly expensive, we're limiting *leaderboard submissions* to 10 minutes on 8xH100s. However, we'd still love to see submissions that don't meet the compute limitation requirements in our 'Non-record Submissions' section: We're excited to see people push the infinite frontier of parameter limited performance as well.
+
+We also know compute is expensive, so OpenAI is sponsoring $1,000,000 in compute credits to help people get started training their models. To request a compute grant, use this form: [Request a Compute Grant](https://openai.com/index/parameter-golf/#credit-form).
+
+## Participant Form
+
+If you enjoy solving very difficult technical problems, please introduce yourself via the [Challenge Participant Form](https://jobs.ashbyhq.com/openai/form/open-ai-challenge-parameter-golf). It helps us attribute challenge submissions and reach out about opportunities with OpenAI. _Completing the form is not required to participate._
+
+Many researchers at OpenAI first distinguished themselves through elite mathematics and programming competitions. The Model Craft Challenge is designed in that spirit: testing the ability to tackle unfamiliar problems with creativity and rigor, qualities we believe are essential for frontier AI research.
+
+In June, we plan to hire a small cohort of early-career researchers, targeting current undergraduate students and recent graduates, including Olympiad medalists and elite competitors. For exceptional participants, the challenge may also serve as a way to stand out to OpenAI researchers and recruiters.
+
+The challenge runs from March 18th to April 30th. 
+
+Happy training!
+
+## Leaderboard
+
+
+| Rank | Run | Score | Author | Summary | Date | Info |
+|-----:|-----|------:|--------|---------|------|------|
+| 1 | Naive Baseline | 1.2244 | Baseline | 9layer 512dim 1024vocab TiedEmbeddings 4 KV heads | 2026-03-18 | [info](records/track_10min_16mb/2026-03-17_NaiveBaseline/README.md) |
+
+#### Notable Non-Record Runs
+
+| Run | Score | Author | Summary | Date | Info |
+|-----|------:|--------|---------|------|------|
+| 4-Hour Baseline | 1.2074 | Will DePue | Testing unlimited compute, 4 hours on 8xH100 | 2026-03-18 | [info](records/track_non_record_16mb/2026-03-18_Quasi10Bfrom50B_SP1024_9x512_KV4_4h_pgut3/README.md) |
+
+## Getting Started
+
+### Training Your First Model (Mac with Apple Silicon)
+
+If you have an Apple laptop or desktop with Apple Silicon, we've set up a simple MLX training script to help you start iterating locally.
+
+If you don't have a Mac with Apple Silicon, you can run an adapted version of this script without MLX support. Just ask [Codex](https://openai.com/codex/) to refactor it; the change is straightforward. It may still be fairly slow, so we recommend jumping straight to cloud GPUs with Runpod.
+
+First, clone the repository, create a fresh Python environment, and install the packages needed for the MLX path plus dataset download:
+
+```bash
+git clone https://github.com/openai/parameter-golf.git
+cd parameter-golf
+python3 -m venv .venv
+source .venv/bin/activate
+python -m pip install --upgrade pip
+pip install mlx numpy sentencepiece huggingface-hub datasets tqdm
+```
+
+Download our cached version of FineWeb with the 1024-token vocabulary:
+
+```bash
+python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 10
+```
+
+This populates `./data/datasets/fineweb10B_sp1024/` and `./data/tokenizers/`.
+By default this downloads the full validation split plus 80 training shards (8B tokens). For a smaller local smoke subset, pass `--train-shards 1`, for example `python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 1`.
+
+Then run a small MLX training job:
+
+```bash
+RUN_ID=mlx_smoke \
+ITERATIONS=200 \
+TRAIN_BATCH_TOKENS=8192 \
+VAL_LOSS_EVERY=0 \
+VAL_BATCH_SIZE=8192 \
+python3 train_gpt_mlx.py
+```
+
+Validation always runs on the full `fineweb_val_*` split, which is the fixed first-50k-document set. The smoke command above skips periodic validation and just prints the final `val_loss` and `val_bpb` once at the end.
+
+### Scaling Up to a Remote Machine
+
+Once you're happy with your local tests, or you want more compute, switch to a remote CUDA machine.
+
+You can rent GPUs from anywhere, but OpenAI is partnering with Runpod to make setup as easy as possible.  
+
+#### Launching a 1xH100 Pod
+
+1. First, [create a Runpod account](https://console.runpod.io/deploy). You should also set up an SSH key in the Settings tab on the left so you can connect to your remote machine. If you're new to this, ask Codex to help you set it up.
+
+2. Once you've set up your account, create a new GPU Cloud Pod. You can choose whichever GPU SKU you'd like. Final leaderboard submissions must run in under 10 minutes on 8xH100s, but we strongly recommend testing and running experiments on cheaper SKUs first, since an 8xH100 box can cost around $20/hour.
+
+3. Let's start with a 1xH100 pod. Deploy using the official Parameter Golf template: [Launch Template](https://console.runpod.io/deploy?template=y5cejece4j&ref=nl2r56th). Enable SSH terminal access, leaving the other settings at their defaults. Deploy your pod and SSH into it once it's up. You should land in `/workspace/`.
+
+On your remote machine, clone the repo onto local disk. All Python dependencies are already pre-installed in the image.
+
+```bash
+cd /workspace
+git clone https://github.com/openai/parameter-golf.git
+cd parameter-golf
+```
+
+Download our cached version of FineWeb. We'll use the 1024-token vocabulary for now.
+
+```bash
+python3 data/cached_challenge_fineweb.py --variant sp1024
+```
+
+This defaults to the full validation split plus 80 training shards (8B tokens). If you only want a smaller subset while iterating, pass `--train-shards N`, for example `--train-shards 1`.
+
+Launch your first training run. Note that we're passing `nproc_per_node=1` because we're running on a single H100 GPU in this case.
+
+```bash
+RUN_ID=baseline_sp1024 \
+DATA_PATH=./data/datasets/fineweb10B_sp1024/ \
+TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model \
+VOCAB_SIZE=1024 \
+torchrun --standalone --nproc_per_node=1 train_gpt.py
+```
+
+By default, `train_gpt.py` keeps its ~10 minute wallclock cap. If you want a longer run, override it explicitly, for example `MAX_WALLCLOCK_SECONDS=0`.
+
+By default, this command prints `train_loss` step logs during training and prints `val_loss`, `val_bpb`, and compressed model size in the final `final_int8_zlib_roundtrip` lines at the end. If you want periodic validation logs during the run, set `VAL_LOSS_EVERY`, for example `VAL_LOSS_EVERY=200`. For the baseline config, the final `val_bpb` should land around ~1.2 with a compressed model size under 16MB.
+
+For dataset export, tokenizer export, and docs-cache rebuild instructions, see [data/README.md](data/README.md).
+
+
+## FAQ
+
+**What exactly counts toward the 16MB artifact size?**
+
+The submission artifact is computed as code bytes plus compressed model bytes. All counted code should live in the `train_gpt.py` script.
+The cap is decimal 16MB, i.e. 16,000,000 total bytes, not 16 MiB / 16,777,216 bytes.
+No external downloads, training dataset access, or network calls are allowed during evaluation. The artifact must be fully self-contained and reproducible.
+
+**Are scores independently verified by OpenAI?**
+
+We're not automatically verifying every submission, but we will verify the top leaderboard entries over time. Any non-reproducible results can be disqualified, and issues reproducing submissions should be raised on the PR.
+
+**What counts as 'external compute'? For example, is it fair to tune my hyperparameters offline?**
+
+There's no perfectly clear answer here and it's hard to draw a clean line around what does or does not count as external compute. For now, we're reserving the right to disqualify runs that are not in the spirit of the challenge. Tuning your Adam hyperparameters across a bunch of runs is fine, but if there's evidence that you're sneaking in additional compute unfairly, such as brute-forcing ridiculous seeds, we won't allow it. Use your best judgment and there's no penalty for asking questions.
+
+**What are the restrictions on evaluation?**
+
+We won't accept submissions that take more than 10 minutes on 8xH100 to evaluate, but otherwise you're free to evaluate however. As with modded-nanogpt, we allow evaluation at any sequence length. And, obviously, you aren't allowed to access any training data during evaluation, unless you pay for those bits in the <16MB limit. We encourage competitors to push the bounds of evaluation methods as aggressively as with training methods.
+
+## Submission Process
+
+New SOTA records must fulfill the following criteria:
+
+1. They must beat the existing SOTA by at least 0.005 nats. As in modded-nanogpt, because of inter-run variance all submissions must provide enough run logs to show at `p < 0.01` that they achieved the required 0.005-nat improvement. For submissions that improve speed through systems optimization without changing the ML, this requirement is waived.
+
+2. If changes are made to the tokenizer or dataset, prove with certainty that the val_bpb is correctly calculated. Submissions that edit the tokenizer will be examined much more carefully, since bugs may unjustly improve your score.
+
+3. Reproducibly run in under 10 minutes on 8xH100s.
+
+All submissions should be made as a pull request that only adds a new folder to the appropriate `/records` subfolder and includes the following files. Submissions without the full set of requirements will not be accepted.
+
+1. A README.md file that explains the submission in reasonable detail.
+
+2. A `submission.json` file (see the example runs) that includes your name, GitHub ID, `val_bpb`, and related metadata.
+
+3. A train log, automatically produced by your script.
+
+4. A `train_gpt.py` script and any other dependencies. Note: this must successfully compile and run within the records folder. Broken scripts will not be accepted.
+
+### Non-record Submissions
+
+Submissions are also open to unique and interesting approaches that might not beat the existing SOTA, but still satisfy the 16MB artifact limit. We strongly encourage participants to submit implementations for weird or out-of-the-box ideas, in-progress or unoptimized solutions, so long as they run successfully, or even interesting negative results. We're excited to see what you come up with. We'll still maintain a high bar for non-record submissions, so be sure to justify your ideas and results in detail when submitting.
+
+We also accept non-record submissions to an unlimited compute track for runs that are not intended to meet the 10-minute cutoff. Just note as such in your README file.
+
+Non-record submissions should be made in the same fashion as SOTA records, as described above.
+
+#### PRs on Core Code
+
+The `train_gpt.py` and `train_gpt_mlx.py` scripts are intended as good launching-off points for new participants, not SOTA configs. We'll accept PRs that tune, improve, or simplify these scripts without significantly increasing complexity, but the best models should stay in the `/records` folder.
+
+## Support
+
+
+Join the [OpenAI Discord server](https://discord.com/invite/openai) and visit the Parameter Golf channels (#parameter-golf-discussions, #parameter-golf-announcements) and ask questions.
+
+This repository adapts code from `modded-nanogpt`, see [THIRD_PARTY_NOTICES.md](THIRD_PARTY_NOTICES.md) for attribution.
diff --git a/THIRD_PARTY_NOTICES.md b/THIRD_PARTY_NOTICES.md
new file mode 100644
index 0000000..ae833d3
--- /dev/null
+++ b/THIRD_PARTY_NOTICES.md
@@ -0,0 +1,29 @@
+# Third-Party Notices
+
+## modded-nanogpt
+
+Parts of this repository were adapted from [KellerJordan/modded-nanogpt](https://github.com/KellerJordan/modded-nanogpt), which is licensed under the MIT License.
+
+Original source: https://github.com/KellerJordan/modded-nanogpt
+
+MIT License
+
+Copyright (c) 2024 Keller Jordan
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
diff --git a/data/README.md b/data/README.md
new file mode 100644
index 0000000..e1920ad
--- /dev/null
+++ b/data/README.md
@@ -0,0 +1,66 @@
+# Data Workflows
+
+This directory contains the dataset download helpers and export scripts used for the challenge.
+
+Canonical local layout:
+- `data/datasets/<dataset_name>/`
+- `data/tokenizers/`
+- `data/manifest.json`
+- `data/docs_selected.jsonl`
+- `data/docs_selected.source_manifest.json`
+
+## Downloading Published Data
+
+Download the cached FineWeb export for a tokenizer variant with:
+
+```bash
+python3 data/cached_challenge_fineweb.py --variant sp1024
+```
+
+This populates `./data/datasets/fineweb10B_sp1024/` and `./data/tokenizers/`.
+By default it downloads the full validation split and 8B training tokens (80 train shards).
+
+To fetch more training shards, pass `--train-shards`:
+
+```bash
+python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 180
+```
+
+The downloader is manifest-driven and can fetch only a prefix of train shards from a larger published export. With the current shard size of `100_000_000` tokens, `10B` retokenized training tokens is `100` train shards:
+
+```bash
+MATCHED_FINEWEB_REPO_ID=your-hf-username/your-dataset-repo \
+MATCHED_FINEWEB_REMOTE_ROOT_PREFIX=your_50B_export_root \
+python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 100
+```
+
+Validation is always downloaded in full from the fixed `fineweb_val_*` split. Training on the first `N` train shards means training on the prefix of the same frozen shuffled export, so the data order stays aligned with the baseline for that tokenizer family.
+
+The default published repo is `willdepueoai/parameter-golf`, with the export rooted under the repo subdirectory `datasets/`.
+
+## Rebuilding Tokenizers From Published Docs
+
+To retrain a tokenizer or re-export shards from exactly the same selected documents, run the standalone retokenizer against the published docs cache:
+
+```bash
+python3 data/download_hf_docs_and_tokenize.py \
+  --repo-id your-hf-username/your-dataset-repo \
+  --remote-root your_50B_export_root \
+  --output-root /tmp/my_custom_tokenizer_export \
+  --tokenizer-config ./data/tokenizer_specs.json
+```
+
+The sidecar `docs_selected.source_manifest.json` includes `docs_sha256`, so users can verify they are rebuilding from the exact same document list and order as the baseline export.
+
+## Useful Knobs
+
+For CPU-heavy exports, useful knobs are:
+
+```bash
+MATCHED_FINEWEB_SP_BATCH_SIZE=2048
+MATCHED_FINEWEB_TOKENIZER_THREADS=16
+MATCHED_FINEWEB_TIKTOKEN_THREADS=16
+MATCHED_FINEWEB_GPT2_DECODE_BATCH_SIZE=512
+```
+
+These control batched tokenizer encoding during shard export, tokenizer thread count, tiktoken thread count, and batched GPT-2 decode for the blobstore docs-cache path.
diff --git a/data/cached_challenge_fineweb.py b/data/cached_challenge_fineweb.py
new file mode 100644
index 0000000..fa8029b
--- /dev/null
+++ b/data/cached_challenge_fineweb.py
@@ -0,0 +1,157 @@
+import argparse
+import json
+import os
+import shutil
+from pathlib import Path
+
+from huggingface_hub import hf_hub_download
+
+
+REPO_ID = os.environ.get("MATCHED_FINEWEB_REPO_ID", "willdepueoai/parameter-golf")
+REMOTE_ROOT_PREFIX = os.environ.get("MATCHED_FINEWEB_REMOTE_ROOT_PREFIX", "datasets")
+ROOT = Path(__file__).resolve().parent
+DATASETS_DIR = ROOT / "datasets"
+TOKENIZERS_DIR = ROOT / "tokenizers"
+
+def dataset_dir_for_variant(name: str) -> str:
+    if name == "byte260":
+        return "fineweb10B_byte260"
+    if name.startswith("sp") and name[2:].isdigit():
+        return f"fineweb10B_{name}"
+    raise ValueError(f"unsupported variant {name!r}; expected byte260 or sp<VOCAB_SIZE>")
+
+
+def local_path_for_remote(relative_path: str) -> Path:
+    remote_path = Path(relative_path)
+    if REMOTE_ROOT_PREFIX and remote_path.parts[:1] == (REMOTE_ROOT_PREFIX,):
+        remote_path = remote_path.relative_to(REMOTE_ROOT_PREFIX)
+    if remote_path.parts[:1] == ("datasets",):
+        return DATASETS_DIR.joinpath(*remote_path.parts[1:])
+    if remote_path.parts[:1] == ("tokenizers",):
+        return TOKENIZERS_DIR.joinpath(*remote_path.parts[1:])
+    return ROOT / remote_path
+
+
+def get(relative_path: str) -> None:
+    destination = local_path_for_remote(relative_path)
+    if destination.exists():
+        return
+    if destination.is_symlink():
+        destination.unlink()
+
+    remote_path = Path(relative_path)
+    cached_path = Path(
+        hf_hub_download(
+            repo_id=REPO_ID,
+            filename=remote_path.name,
+            subfolder=remote_path.parent.as_posix() if remote_path.parent != Path(".") else None,
+            repo_type="dataset",
+        )
+    )
+    # HF cache entries may be snapshot symlinks. Resolve to the underlying blob so we
+    # always materialize a real file in data/, not a broken relative symlink.
+    cached_source = cached_path.resolve(strict=True)
+    destination.parent.mkdir(parents=True, exist_ok=True)
+    try:
+        os.link(cached_source, destination)
+    except OSError:
+        shutil.copy2(cached_source, destination)
+
+
+def manifest_path() -> Path:
+    return local_path_for_remote(f"{REMOTE_ROOT_PREFIX}/manifest.json")
+
+
+def load_manifest(*, skip_manifest_download: bool) -> dict:
+    path = manifest_path()
+    if not path.is_file():
+        if skip_manifest_download:
+            raise FileNotFoundError(
+                f"manifest.json is required for manifest-driven shard counts but is not present locally at {path}"
+            )
+        get(f"{REMOTE_ROOT_PREFIX}/manifest.json")
+    return json.loads(path.read_text(encoding="utf-8"))
+
+
+def artifact_paths_for_tokenizer(tokenizer_entry: dict) -> list[str]:
+    artifacts = []
+    for key in ("model_path", "vocab_path", "path"):
+        value = tokenizer_entry.get(key)
+        if value:
+            artifacts.append(str(value))
+    if not artifacts:
+        raise ValueError(f"tokenizer entry is missing downloadable artifacts: {tokenizer_entry}")
+    return artifacts
+
+
+def build_parser() -> argparse.ArgumentParser:
+    parser = argparse.ArgumentParser(description="Download challenge FineWeb shards from Hugging Face")
+    parser.add_argument(
+        "train_shards_positional",
+        nargs="?",
+        type=int,
+        default=None,
+        help=argparse.SUPPRESS,
+    )
+    parser.add_argument(
+        "--train-shards",
+        type=int,
+        default=80,
+        help="Number of training shards to download for the selected variant. Defaults to 80.",
+    )
+    parser.add_argument(
+        "--variant",
+        default="sp1024",
+        help="Tokenizer family to download, for example sp1024, sp4096, or byte260.",
+    )
+    parser.add_argument(
+        "--skip-manifest",
+        action="store_true",
+        help="Skip downloading manifest.json.",
+    )
+    parser.add_argument(
+        "--with-docs",
+        action="store_true",
+        help="Also download docs_selected.jsonl and its sidecar for tokenizer retraining or dataset re-export.",
+    )
+    return parser
+
+
+def main() -> None:
+    args = build_parser().parse_args()
+    dataset_dir = dataset_dir_for_variant(args.variant)
+    train_shards = args.train_shards_positional if args.train_shards_positional is not None else args.train_shards
+    if train_shards < 0:
+        raise ValueError("train_shards must be non-negative")
+
+    manifest = load_manifest(skip_manifest_download=args.skip_manifest)
+    dataset_entry = next((x for x in manifest.get("datasets", []) if x.get("name") == dataset_dir), None)
+    if dataset_entry is None:
+        raise ValueError(f"dataset {dataset_dir} not found in {REMOTE_ROOT_PREFIX}/manifest.json")
+    max_train_shards = int((dataset_entry.get("stats") or {}).get("files_train"))
+    val_shards = int((dataset_entry.get("stats") or {}).get("files_val"))
+    if train_shards > max_train_shards:
+        raise ValueError(
+            f"{args.variant} only has {max_train_shards} training shards on {REPO_ID}, requested {train_shards}"
+        )
+    tokenizer_name = dataset_entry.get("tokenizer_name")
+    tokenizer_entry = next((x for x in manifest.get("tokenizers", []) if x.get("name") == tokenizer_name), None)
+    if tokenizer_entry is None:
+        raise ValueError(f"tokenizer {tokenizer_name} not found in {REMOTE_ROOT_PREFIX}/manifest.json")
+
+    if args.with_docs:
+        get(f"{REMOTE_ROOT_PREFIX}/docs_selected.jsonl")
+        get(f"{REMOTE_ROOT_PREFIX}/docs_selected.source_manifest.json")
+
+    dataset_prefix = f"{REMOTE_ROOT_PREFIX}/datasets/{dataset_dir}"
+    for i in range(val_shards):
+        get(f"{dataset_prefix}/fineweb_val_{i:06d}.bin")
+    for i in range(train_shards):
+        get(f"{dataset_prefix}/fineweb_train_{i:06d}.bin")
+
+    for artifact_path in artifact_paths_for_tokenizer(tokenizer_entry):
+        get(f"{REMOTE_ROOT_PREFIX}/{artifact_path}")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/data/download_hf_docs_and_tokenize.py b/data/download_hf_docs_and_tokenize.py
new file mode 100644
index 0000000..dcabd40
--- /dev/null
+++ b/data/download_hf_docs_and_tokenize.py
@@ -0,0 +1,627 @@
+"""Download docs_selected.jsonl from Hugging Face and tokenize it locally.
+
+This script is standalone. It does not import any local exporter or tokenizer
+helpers. Tokenizer configs are JSON only and currently support the built-in
+pure-byte and SentencePiece tokenizer definitions in `data/tokenizer_specs.json`.
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import os
+import shutil
+from dataclasses import asdict, dataclass
+from pathlib import Path
+from typing import Any
+
+import numpy as np
+from huggingface_hub import hf_hub_download
+from huggingface_hub.utils import EntryNotFoundError
+
+
+DOCS_FILENAME = "docs_selected.jsonl"
+SIDECAR_FILENAME = "docs_selected.source_manifest.json"
+VERSION = "10B"
+NUM_VAL_DOCS = 50_000
+SHARD_SIZE = 10**8
+APPEND_EOS = False
+DATAFILE_MAGIC = 20240520
+DATAFILE_VERSION = 1
+DEFAULT_REPO_ID = os.environ.get("MATCHED_FINEWEB_REPO_ID", "willdepueoai/parameter-golf")
+DEFAULT_REMOTE_ROOT = os.environ.get("MATCHED_FINEWEB_REMOTE_ROOT_PREFIX", "datasets")
+DEFAULT_CONFIG = Path(__file__).with_name("tokenizer_specs.json")
+TOKENIZER_THREADS = max(1, int(os.environ.get("MATCHED_FINEWEB_TOKENIZER_THREADS", str(os.cpu_count() or 8))))
+SP_BATCH_SIZE = max(1, int(os.environ.get("MATCHED_FINEWEB_SP_BATCH_SIZE", "1024")))
+
+
+@dataclass(frozen=True)
+class PureByteTokenizer:
+    pad_id: int = 0
+    bos_id: int = 1
+    eos_id: int = 2
+    unk_id: int = 3
+    byte_offset: int = 4
+    byte_count: int = 256
+
+    @property
+    def vocab_size(self) -> int:
+        return self.byte_offset + self.byte_count
+
+    def encode(self, text: str) -> np.ndarray:
+        data = text.encode("utf-8", errors="replace")
+        return np.frombuffer(data, dtype=np.uint8).astype(np.uint16, copy=False) + self.byte_offset
+
+    def encode_batch(self, texts: list[str]) -> list[np.ndarray]:
+        return [self.encode(text) for text in texts]
+
+    def save_json(self, path: str | Path) -> None:
+        path = Path(path)
+        path.parent.mkdir(parents=True, exist_ok=True)
+        payload = {
+            "tokenizer_type": "pure_byte",
+            "config": asdict(self),
+            "vocab_size": self.vocab_size,
+        }
+        path.write_text(json.dumps(payload, indent=2, sort_keys=True) + "\n", encoding="utf-8")
+
+
+def default_pure_byte_tokenizer() -> PureByteTokenizer:
+    return PureByteTokenizer()
+
+
+def docs_sidecar_path(docs_jsonl: Path) -> Path:
+    return docs_jsonl.with_name(f"{docs_jsonl.stem}.source_manifest.json")
+
+
+def maybe_load_docs_sidecar_meta(docs_jsonl: Path) -> dict[str, Any] | None:
+    sidecar_path = docs_sidecar_path(docs_jsonl)
+    if not sidecar_path.is_file():
+        return None
+    payload = json.loads(sidecar_path.read_text(encoding="utf-8"))
+    if not isinstance(payload, dict):
+        raise ValueError(f"docs sidecar must be a JSON object: {sidecar_path}")
+    return payload
+
+
+def copy_from_hf_cache(*, repo_id: str, remote_root: str, filename: str, destination: Path) -> bool:
+    remote_path = Path(remote_root) / filename if remote_root else Path(filename)
+    try:
+        cached_path = Path(
+            hf_hub_download(
+                repo_id=repo_id,
+                filename=remote_path.name,
+                subfolder=remote_path.parent.as_posix() if remote_path.parent != Path(".") else None,
+                repo_type="dataset",
+            )
+        )
+    except EntryNotFoundError:
+        return False
+
+    source = cached_path.resolve(strict=True)
+    destination.parent.mkdir(parents=True, exist_ok=True)
+    if destination.exists():
+        destination.unlink()
+    try:
+        os.link(source, destination)
+    except OSError:
+        shutil.copy2(source, destination)
+    return True
+
+
+def iter_docs(path: Path):
+    with path.open("r", encoding="utf-8") as f:
+        for line in f:
+            yield json.loads(line)["text"]
+
+
+def count_docs(path: Path) -> int:
+    with path.open("r", encoding="utf-8") as f:
+        return sum(1 for _ in f)
+
+
+def batched_docs_jsonl(path: Path, batch_size: int):
+    batch: list[str] = []
+    for text in iter_docs(path):
+        batch.append(text)
+        if len(batch) == batch_size:
+            yield batch
+            batch = []
+    if batch:
+        yield batch
+
+
+def write_datafile(path: Path, toks: Any) -> None:
+    if len(toks) >= 2**31:
+        raise ValueError("token count too large")
+    header = np.zeros(256, dtype="<i4")
+    header[0] = DATAFILE_MAGIC
+    header[1] = DATAFILE_VERSION
+    header[2] = len(toks)
+    toks = np.asarray(toks)
+    if toks.dtype != np.uint16:
+        if not ((0 <= toks).all() and (toks < 2**16).all()):
+            raise ValueError("token dictionary too large for uint16")
+        toks = toks.astype("<u2", copy=False)
+    else:
+        toks = toks.astype("<u2", copy=False)
+    with path.open("wb") as f:
+        f.write(header.tobytes())
+        f.write(toks.tobytes())
+
+
+def relativize_manifest_paths(value: Any, root: Path) -> Any:
+    if isinstance(value, dict):
+        return {k: relativize_manifest_paths(v, root) for k, v in value.items()}
+    if isinstance(value, list):
+        return [relativize_manifest_paths(v, root) for v in value]
+    if isinstance(value, str):
+        path = Path(value)
+        if path.is_absolute():
+            try:
+                return path.relative_to(root).as_posix()
+            except ValueError:
+                return value
+    return value
+
+
+def parse_reuse_sp_models(values: list[str]) -> dict[int, Path]:
+    reuse_models: dict[int, Path] = {}
+    for value in values:
+        vocab_size_str, model_path = value.split("=", 1)
+        vocab_size = int(vocab_size_str)
+        if vocab_size in reuse_models:
+            raise ValueError(f"duplicate --reuse_sp_model for vocab_size={vocab_size}")
+        reuse_models[vocab_size] = Path(model_path).expanduser().resolve()
+    return reuse_models
+
+
+def load_specs(config_path: Path) -> list[dict[str, Any]]:
+    payload = json.loads(config_path.read_text(encoding="utf-8"))
+    if isinstance(payload, dict):
+        specs = payload.get("tokenizer_specs", payload.get("tokenizers"))
+    else:
+        specs = payload
+    if not isinstance(specs, list) or not specs:
+        raise ValueError("tokenizer_config must define a non-empty list")
+    if not all(isinstance(spec, dict) for spec in specs):
+        raise ValueError("each tokenizer spec must be a JSON object")
+    return [dict(spec) for spec in specs]
+
+
+def tokenizer_kind(spec: dict[str, Any]) -> str:
+    kind = spec.get("kind")
+    if kind in {"byte", "pure_byte"}:
+        return "byte"
+    if kind in {"sentencepiece_bpe", "sentencepiece"}:
+        return "sentencepiece_bpe"
+    builder = str(spec.get("builder", ""))
+    builder_name = builder.rsplit(":", 1)[-1]
+    if builder_name == "build_pure_byte_tokenizer":
+        return "byte"
+    if builder_name == "build_sentencepiece_tokenizer":
+        return "sentencepiece_bpe"
+    if spec.get("dataset_suffix") == "byte260":
+        return "byte"
+    if "vocab_size" in spec:
+        return "sentencepiece_bpe"
+    raise ValueError(
+        f"unsupported tokenizer spec {spec.get('name', '<unnamed>')!r}: "
+        "expected a built-in pure-byte or sentencepiece builder"
+    )
+
+
+def write_tokenizer_config_export(output_root: Path, selected_specs: list[dict[str, Any]]) -> Path:
+    path = output_root / "tokenizer_config.export.json"
+    path.write_text(json.dumps({"tokenizers": selected_specs}, indent=2) + "\n", encoding="utf-8")
+    return path
+
+
+def _iter_sentencepiece_text(docs_jsonl: Path, *, max_docs: int | None = None):
+    with docs_jsonl.open("r", encoding="utf-8") as f:
+        for i, line in enumerate(f):
+            if max_docs is not None and i >= max_docs:
+                break
+            text = json.loads(line)["text"].replace("\x00", " ").strip()
+            if text:
+                yield text
+
+
+def build_pure_byte_tokenizer(*, spec: dict[str, Any], docs_jsonl: Path, tokenizers_dir: Path) -> dict[str, Any]:
+    del docs_jsonl
+    tok = default_pure_byte_tokenizer()
+    path = tokenizers_dir / spec.get("filename", "fineweb_pure_byte_260.json")
+    tok.save_json(path)
+    return {
+        "name": spec.get("name", "pure_byte_260"),
+        "kind": "byte",
+        "dataset_suffix": spec.get("dataset_suffix", "byte260"),
+        "vocab_size": tok.vocab_size,
+        "bos_id": tok.bos_id,
+        "eos_id": tok.eos_id,
+        "encode": tok.encode,
+        "encode_batch": tok.encode_batch,
+        "manifest": {"path": str(path), "pad_id": tok.pad_id, "unk_id": tok.unk_id},
+    }
+
+
+def build_sentencepiece_tokenizer(*, spec: dict[str, Any], docs_jsonl: Path, tokenizers_dir: Path) -> dict[str, Any]:
+    try:
+        import sentencepiece as spm
+    except ImportError as exc:
+        raise RuntimeError("sentencepiece is required for SentencePiece tokenizer exports") from exc
+
+    vocab_size = int(spec["vocab_size"])
+    prefix = tokenizers_dir / spec.get("model_prefix", f"fineweb_{vocab_size}_bpe")
+    model_path = prefix.with_suffix(".model")
+    vocab_path = prefix.with_suffix(".vocab")
+    prefix.parent.mkdir(parents=True, exist_ok=True)
+    for artifact in (model_path, vocab_path):
+        if artifact.exists():
+            artifact.unlink()
+
+    reuse_model_path = spec.get("reuse_model_path")
+    if reuse_model_path is not None:
+        reuse_model_path = Path(reuse_model_path).expanduser().resolve()
+        if not reuse_model_path.is_file():
+            raise FileNotFoundError(reuse_model_path)
+        shutil.copy2(reuse_model_path, model_path)
+        reuse_vocab_path = reuse_model_path.with_suffix(".vocab")
+        if reuse_vocab_path.is_file():
+            shutil.copy2(reuse_vocab_path, vocab_path)
+    else:
+        kwargs = {
+            "sentence_iterator": _iter_sentencepiece_text(
+                docs_jsonl,
+                max_docs=None if spec.get("tokenizer_train_docs") is None else int(spec["tokenizer_train_docs"]),
+            ),
+            "model_prefix": str(prefix),
+            "model_type": "bpe",
+            "vocab_size": vocab_size,
+            "character_coverage": 0.999,
+            "byte_fallback": True,
+            "split_digits": True,
+            "normalization_rule_name": "nmt_nfkc",
+            "add_dummy_prefix": False,
+            "pad_id": 0,
+            "bos_id": 1,
+            "eos_id": 2,
+            "unk_id": 3,
+            "hard_vocab_limit": False,
+        }
+        kwargs.update(spec.get("trainer_overrides") or {})
+        spm.SentencePieceTrainer.train(**kwargs)
+
+    tok = spm.SentencePieceProcessor(model_file=str(model_path))
+    return {
+        "name": spec.get("name", f"sp_bpe_{vocab_size}"),
+        "kind": "sentencepiece_bpe",
+        "dataset_suffix": spec.get("dataset_suffix", f"sp{vocab_size}"),
+        "vocab_size": int(tok.vocab_size()),
+        "bos_id": int(tok.bos_id()),
+        "eos_id": int(tok.eos_id()),
+        "encode": lambda text, tok=tok: tok.encode(text, out_type=int),
+        "encode_batch": lambda texts, tok=tok: tok.encode(texts, out_type=int, num_threads=TOKENIZER_THREADS),
+        "manifest": {"model_path": str(model_path), "vocab_path": str(vocab_path)},
+    }
+
+
+def export_shards(
+    docs_jsonl: Path,
+    tok: dict[str, Any],
+    output_dir: Path,
+    *,
+    num_val_docs: int,
+    shard_size: int,
+    docs_total: int,
+) -> dict[str, int]:
+    output_dir.mkdir(parents=True, exist_ok=True)
+    for pattern in ("fineweb_train_*.bin", "fineweb_val_*.bin"):
+        for stale in output_dir.glob(pattern):
+            stale.unlink()
+
+    stats = {
+        "docs_total": 0,
+        "docs_val": 0,
+        "docs_train": 0,
+        "files_total": 0,
+        "files_val": 0,
+        "files_train": 0,
+        "tokens_total": 0,
+        "tokens_val": 0,
+        "tokens_train": 0,
+    }
+    buf = np.empty((shard_size,), dtype=np.uint16)
+    fill = 0
+    split = "val"
+    shards = {"val": 0, "train": 0}
+
+    def flush() -> None:
+        nonlocal fill
+        if fill == 0:
+            return
+        write_datafile(output_dir / f"fineweb_{split}_{shards[split]:06d}.bin", buf[:fill])
+        stats["files_total"] += 1
+        stats[f"files_{split}"] += 1
+        shards[split] += 1
+        fill = 0
+
+    vocab_size = int(tok["vocab_size"])
+    if vocab_size > 2**16:
+        raise ValueError(f"vocab_size={vocab_size} is too large for uint16 shard storage")
+
+    batch_encode = tok.get("encode_batch")
+    batch_size = SP_BATCH_SIZE if callable(batch_encode) else 1
+    for texts in batched_docs_jsonl(docs_jsonl, batch_size):
+        encoded_docs = batch_encode(texts) if callable(batch_encode) else [tok["encode"](text) for text in texts]
+        for text, encoded in zip(texts, encoded_docs, strict=True):
+            del text
+            split_for_doc = "val" if stats["docs_total"] < num_val_docs else "train"
+            if split_for_doc != split:
+                flush()
+                split = split_for_doc
+
+            encoded_arr = np.asarray(encoded, dtype=np.int32)
+            toks = np.empty((encoded_arr.size + 1 + int(APPEND_EOS),), dtype=np.int32)
+            toks[0] = tok["bos_id"]
+            toks[1 : 1 + encoded_arr.size] = encoded_arr
+            if APPEND_EOS:
+                toks[-1] = tok["eos_id"]
+            if not ((0 <= toks).all() and (toks < vocab_size).all()):
+                bad = int(toks[(toks < 0) | (toks >= vocab_size)][0])
+                raise ValueError(f"token id {bad} outside declared vocab_size={vocab_size}")
+            toks = toks.astype("<u2", copy=False)
+
+            stats["docs_total"] += 1
+            stats[f"docs_{split}"] += 1
+            stats["tokens_total"] += len(toks)
+            stats[f"tokens_{split}"] += len(toks)
+
+            pos = 0
+            while pos < len(toks):
+                take = min(shard_size - fill, len(toks) - pos)
+                buf[fill : fill + take] = toks[pos : pos + take]
+                fill += take
+                pos += take
+                if fill == shard_size:
+                    flush()
+
+        if stats["docs_total"] and stats["docs_total"] % 100_000 == 0:
+            print(f"{output_dir.name}: {stats['docs_total']}/{docs_total} docs", flush=True)
+
+    flush()
+    if stats["docs_total"] != docs_total:
+        raise ValueError(f"expected {docs_total} docs, exported {stats['docs_total']}")
+    return stats
+
+
+def build_tokenizers(
+    *,
+    specs: list[dict[str, Any]],
+    docs_jsonl: Path,
+    tokenizers_dir: Path,
+    tokenizer_train_docs: int | None,
+    skip_byte: bool,
+    reuse_sp_models: dict[int, Path],
+) -> tuple[list[dict[str, Any]], list[dict[str, Any]]]:
+    tokenizers: list[dict[str, Any]] = []
+    selected_specs: list[dict[str, Any]] = []
+    seen_names: set[str] = set()
+    seen_datasets: set[str] = set()
+
+    for raw_spec in specs:
+        spec = dict(raw_spec)
+        kind = tokenizer_kind(spec)
+        if skip_byte and kind == "byte":
+            continue
+        if kind == "sentencepiece_bpe":
+            if tokenizer_train_docs is not None:
+                spec["tokenizer_train_docs"] = int(tokenizer_train_docs)
+            vocab_size = int(spec["vocab_size"])
+            if vocab_size in reuse_sp_models:
+                spec["reuse_model_path"] = str(reuse_sp_models[vocab_size])
+
+        selected_specs.append(spec)
+        built = (
+            build_pure_byte_tokenizer(spec=spec, docs_jsonl=docs_jsonl, tokenizers_dir=tokenizers_dir)
+            if kind == "byte"
+            else build_sentencepiece_tokenizer(spec=spec, docs_jsonl=docs_jsonl, tokenizers_dir=tokenizers_dir)
+        )
+        name = str(built["name"])
+        dataset_suffix = built.get("dataset_suffix")
+        dataset_name = str(built.get("dataset_name", f"fineweb{VERSION}_{dataset_suffix}"))
+        if name in seen_names:
+            raise ValueError(f"duplicate tokenizer name: {name}")
+        if dataset_name in seen_datasets:
+            raise ValueError(f"duplicate dataset name: {dataset_name}")
+        seen_names.add(name)
+        seen_datasets.add(dataset_name)
+        vocab_size = int(built["vocab_size"])
+        recommended_bigram_vocab_size = int(
+            built.get("recommended_bigram_vocab_size", ((vocab_size + 127) // 128) * 128 * 5)
+        )
+        tokenizers.append(
+            {
+                "name": name,
+                "kind": str(built["kind"]),
+                "dataset_name": dataset_name,
+                "vocab_size": vocab_size,
+                "bos_id": int(built["bos_id"]),
+                "eos_id": int(built["eos_id"]),
+                "encode": built["encode"],
+                "encode_batch": built.get("encode_batch"),
+                "recommended_bigram_vocab_size": recommended_bigram_vocab_size,
+                "manifest": {
+                    "name": name,
+                    "kind": str(built["kind"]),
+                    "vocab_size": vocab_size,
+                    "bos_id": int(built["bos_id"]),
+                    "eos_id": int(built["eos_id"]),
+                    "recommended_bigram_vocab_size": recommended_bigram_vocab_size,
+                    "source_spec": spec,
+                    **(built.get("manifest") or {}),
+                },
+            }
+        )
+    if not tokenizers:
+        raise ValueError("tokenizer_config produced no tokenizers after filtering")
+    return tokenizers, selected_specs
+
+
+def build_parser() -> argparse.ArgumentParser:
+    parser = argparse.ArgumentParser(
+        description="Download docs_selected.jsonl from a Hugging Face dataset repo and tokenize it locally"
+    )
+    parser.add_argument(
+        "--repo-id",
+        default=DEFAULT_REPO_ID,
+        help="Hugging Face dataset repo id, for example user/dataset",
+    )
+    parser.add_argument(
+        "--remote-root",
+        default=DEFAULT_REMOTE_ROOT,
+        help="Optional subdirectory inside the dataset repo that contains docs_selected.jsonl",
+    )
+    parser.add_argument("--output-root", required=True, help="Directory where docs, tokenizers, shards, and manifest are written")
+    parser.add_argument(
+        "--tokenizer-config",
+        default=str(DEFAULT_CONFIG),
+        help="Local tokenizer config JSON. Defaults to data/tokenizer_specs.json.",
+    )
+    parser.add_argument(
+        "--num-val-docs",
+        type=int,
+        default=None,
+        help="Validation document count. Defaults to the downloaded sidecar when present, otherwise 50000.",
+    )
+    parser.add_argument("--chunk-tokens", type=int, default=SHARD_SIZE, help="Shard size in tokens.")
+    parser.add_argument(
+        "--tokenizer-train-docs",
+        type=int,
+        default=None,
+        help="Limit the number of docs used for tokenizer training.",
+    )
+    parser.add_argument("--skip-byte", action="store_true", help="Skip byte-tokenizer export.")
+    parser.add_argument(
+        "--reuse-sp-model",
+        action="append",
+        default=[],
+        metavar="VOCAB=MODEL",
+        help="Reuse an existing SentencePiece model for the given vocab size instead of retraining it.",
+    )
+    return parser
+
+
+def main() -> None:
+    args = build_parser().parse_args()
+    if args.chunk_tokens <= 0:
+        raise ValueError(f"--chunk_tokens must be positive, got {args.chunk_tokens}")
+
+    output_root = Path(args.output_root).expanduser().resolve()
+    output_root.mkdir(parents=True, exist_ok=True)
+    tokenizers_dir = output_root / "tokenizers"
+    datasets_dir = output_root / "datasets"
+    tokenizers_dir.mkdir(parents=True, exist_ok=True)
+    datasets_dir.mkdir(parents=True, exist_ok=True)
+
+    docs_jsonl = output_root / DOCS_FILENAME
+    sidecar = output_root / SIDECAR_FILENAME
+    if not copy_from_hf_cache(
+        repo_id=args.repo_id,
+        remote_root=args.remote_root,
+        filename=DOCS_FILENAME,
+        destination=docs_jsonl,
+    ):
+        remote = f"{args.remote_root}/{DOCS_FILENAME}" if args.remote_root else DOCS_FILENAME
+        raise FileNotFoundError(f"{remote} not found in Hugging Face dataset repo {args.repo_id}")
+    if not copy_from_hf_cache(
+        repo_id=args.repo_id,
+        remote_root=args.remote_root,
+        filename=SIDECAR_FILENAME,
+        destination=sidecar,
+    ):
+        sidecar.unlink(missing_ok=True)
+
+    docs_sidecar = maybe_load_docs_sidecar_meta(docs_jsonl)
+    docs_total = int(docs_sidecar["num_docs"]) if docs_sidecar is not None and docs_sidecar.get("num_docs") is not None else count_docs(docs_jsonl)
+    if args.num_val_docs is not None:
+        num_val_docs = int(args.num_val_docs)
+    elif docs_sidecar is not None and docs_sidecar.get("docs_val") is not None:
+        num_val_docs = int(docs_sidecar["docs_val"])
+    else:
+        num_val_docs = NUM_VAL_DOCS
+    if not (0 <= num_val_docs <= docs_total):
+        raise ValueError(f"num_val_docs must be in [0, {docs_total}], got {num_val_docs}")
+
+    specs = load_specs(Path(args.tokenizer_config).expanduser().resolve())
+    reuse_sp_models = parse_reuse_sp_models(args.reuse_sp_model)
+    tokenizers, selected_specs = build_tokenizers(
+        specs=specs,
+        docs_jsonl=docs_jsonl,
+        tokenizers_dir=tokenizers_dir,
+        tokenizer_train_docs=args.tokenizer_train_docs,
+        skip_byte=args.skip_byte,
+        reuse_sp_models=reuse_sp_models,
+    )
+    write_tokenizer_config_export(output_root, selected_specs)
+
+    docs_meta = {
+        "remote_repo_id": args.repo_id,
+        "remote_root": args.remote_root,
+        "num_docs": docs_total,
+        "docs_sha256": None if docs_sidecar is None else docs_sidecar.get("docs_sha256"),
+        "source_manifest": str(docs_sidecar_path(docs_jsonl)) if docs_sidecar is not None else None,
+    }
+    if docs_sidecar is not None:
+        docs_meta["source_sidecar"] = docs_sidecar
+
+    manifest = {
+        "version": VERSION,
+        "num_docs": docs_total,
+        "num_val_docs": num_val_docs,
+        "shuffle_seed": None if docs_sidecar is None else docs_sidecar.get("shuffle_seed"),
+        "shard_size": int(args.chunk_tokens),
+        "append_eos": APPEND_EOS,
+        "docs_jsonl": str(docs_jsonl),
+        "docs_meta": docs_meta,
+        "tokenizer_specs": selected_specs,
+        "tokenizers": [],
+        "datasets": [],
+    }
+
+    for tok in tokenizers:
+        output_dir = datasets_dir / tok["dataset_name"]
+        print(f"Exporting dataset: {tok['dataset_name']}", flush=True)
+        stats = export_shards(
+            docs_jsonl,
+            tok,
+            output_dir,
+            num_val_docs=num_val_docs,
+            shard_size=int(args.chunk_tokens),
+            docs_total=docs_total,
+        )
+        manifest["tokenizers"].append(tok["manifest"])
+        manifest["datasets"].append(
+            {
+                "name": tok["dataset_name"],
+                "tokenizer_name": tok["name"],
+                "tokenizer_kind": tok["kind"],
+                "path": str(output_dir),
+                "train_glob": str(output_dir / "fineweb_train_*.bin"),
+                "val_glob": str(output_dir / "fineweb_val_*.bin"),
+                "vocab_size": tok["vocab_size"],
+                "bos_id": tok["bos_id"],
+                "eos_id": tok["eos_id"],
+                "recommended_bigram_vocab_size": tok["recommended_bigram_vocab_size"],
+                "stats": stats,
+            }
+        )
+
+    manifest = relativize_manifest_paths(manifest, output_root)
+    manifest_path = output_root / "manifest.json"
+    manifest_path.write_text(json.dumps(manifest, indent=2) + "\n", encoding="utf-8")
+    print(f"Done. Manifest: {manifest_path}", flush=True)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/data/tokenizer_specs.json b/data/tokenizer_specs.json
new file mode 100644
index 0000000..d7ad1ca
--- /dev/null
+++ b/data/tokenizer_specs.json
@@ -0,0 +1,9 @@
+{
+  "tokenizers": [
+    {
+      "name": "sp_bpe_1024",
+      "dataset_suffix": "sp1024",
+      "vocab_size": 1024
+    }
+  ]
+}
diff --git a/records/track_10min_16mb/2026-03-17_NaiveBaseline/README.md b/records/track_10min_16mb/2026-03-17_NaiveBaseline/README.md
new file mode 100644
index 0000000..1eb352a
--- /dev/null
+++ b/records/track_10min_16mb/2026-03-17_NaiveBaseline/README.md
@@ -0,0 +1,46 @@
+This record captures the `Simple Baseline`.
+
+Trainer changes in this snapshot:
+- current repository `train_gpt.py` snapshot copied into the record folder
+- published `fineweb10B_sp1024` dataset and tokenizer loaded from the new Hugging Face export
+- 10-minute wallclock cap on `8xH100`
+- periodic validation every `200` steps on the full `fineweb_val_*` split
+
+Configuration:
+- Layout: `VOCAB_SIZE=1024 NUM_LAYERS=9 MODEL_DIM=512 NUM_HEADS=8 NUM_KV_HEADS=4 MLP_MULT=2`
+- Tied output/input embeddings: `TIE_EMBEDDINGS=1`
+- Tied embedding LR: `TIED_EMBED_LR=0.05`
+- Batching: `TRAIN_BATCH_TOKENS=524288 TRAIN_SEQ_LEN=1024`
+
+Command (track-relevant params):
+```bash
+NCCL_IB_DISABLE=1 \
+RUN_ID=hf_verify_sp1024_8gpu \
+DATA_PATH=/root/code/parameter-golf/data/datasets/fineweb10B_sp1024 \
+TOKENIZER_PATH=/root/code/parameter-golf/data/tokenizers/fineweb_1024_bpe.model \
+VOCAB_SIZE=1024 \
+MAX_WALLCLOCK_SECONDS=600 \
+TRAIN_LOG_EVERY=50 \
+VAL_LOSS_EVERY=200 \
+torchrun --standalone --nproc_per_node=8 /root/code/parameter-golf/train_gpt.py
+```
+
+Key metrics (from `train.log`):
+- Timed training stopped at `13780/20000` steps due to the wallclock cap.
+- Pre-quant eval at stop: `val_loss:2.0606`, `val_bpb:1.2172`
+- Post-quant roundtrip eval: `val_loss:2.0727`, `val_bpb:1.2244`
+- Exact printed metric: `final_int8_zlib_roundtrip_exact val_bpb:1.22436570`
+- Train time: `600038ms` (`step_avg:43.54ms`)
+- Peak memory: `10184 MiB allocated`, `10200 MiB reserved`
+- Serialized model int8+zlib: `15815847 bytes`
+- Code size: `47642 bytes`
+- Total submission size int8+zlib: `15863489 bytes`
+
+Training volume:
+- Global batch: `524288` tokens/step
+- Total train tokens seen: `7224688640`
+
+Included files:
+- `train_gpt.py` (code snapshot used for the run)
+- `train.log` (exact remote training log)
+- `submission.json` (leaderboard metadata)
diff --git a/records/track_10min_16mb/2026-03-17_NaiveBaseline/submission.json b/records/track_10min_16mb/2026-03-17_NaiveBaseline/submission.json
new file mode 100644
index 0000000..faffa83
--- /dev/null
+++ b/records/track_10min_16mb/2026-03-17_NaiveBaseline/submission.json
@@ -0,0 +1,11 @@
+{
+  "author": "Baseline",
+  "github_id": "openai",
+  "name": "Naive Baseline",
+  "blurb": "SP-1024 9x512 KV4 run on pgut1 using the published Hugging Face fineweb10B_sp1024 export and the current train_gpt.py; score is the default final int8+zlib roundtrip metric under the 16,000,000-byte cap.",
+  "date": "2026-03-18T14:56:29Z",
+  "val_loss": 2.07269931,
+  "val_bpb": 1.2243657,
+  "bytes_total": 15863489,
+  "bytes_code": 47642
+}
diff --git a/records/track_10min_16mb/2026-03-17_NaiveBaseline/train.log b/records/track_10min_16mb/2026-03-17_NaiveBaseline/train.log
new file mode 100644
index 0000000..69b17b6
--- /dev/null
+++ b/records/track_10min_16mb/2026-03-17_NaiveBaseline/train.log
@@ -0,0 +1,448 @@
+W0318 14:37:59.159000 871689 site-packages/torch/distributed/run.py:852] 
+W0318 14:37:59.159000 871689 site-packages/torch/distributed/run.py:852] *****************************************
+W0318 14:37:59.159000 871689 site-packages/torch/distributed/run.py:852] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
+W0318 14:37:59.159000 871689 site-packages/torch/distributed/run.py:852] *****************************************
+[W318 14:38:11.514156940 Utils.hpp:137] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function operator())
+[W318 14:38:11.543417305 Utils.hpp:137] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function operator())
+[W318 14:38:11.552597211 Utils.hpp:137] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function operator())
+NCCL version 2.27.5+cuda12.9
+[W318 14:38:11.832390267 Utils.hpp:137] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function operator())
+[W318 14:38:11.842257581 Utils.hpp:137] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function operator())
+[W318 14:38:11.842253680 Utils.hpp:137] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function operator())
+[W318 14:38:11.899166383 Utils.hpp:137] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function operator())
+[W318 14:38:11.901800020 Utils.hpp:137] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function operator())
+
+[2026-03-18 14:38:12] pgut1-0:871784:871848 [5] ibvwrap.c:94 NCCL WARN Call to ibv_open_device failed
+
+[2026-03-18 14:38:12] pgut1-0:871784:871848 [5] p2p_plugin.c:565 NCCL WARN NET/IB : Unable to open device mlx5_an0
+
+[2026-03-18 14:38:12] pgut1-0:871786:871849 [7] ibvwrap.c:94 NCCL WARN Call to ibv_open_device failed
+
+[2026-03-18 14:38:12] pgut1-0:871786:871849 [7] p2p_plugin.c:565 NCCL WARN NET/IB : Unable to open device mlx5_an0
+
+[2026-03-18 14:38:12] pgut1-0:871779:871850 [0] ibvwrap.c:94 NCCL WARN Call to ibv_open_device failed
+
+[2026-03-18 14:38:12] pgut1-0:871779:871850 [0] p2p_plugin.c:565 NCCL WARN NET/IB : Unable to open device mlx5_an0
+
+[2026-03-18 14:38:12] pgut1-0:871780:871857 [1] ibvwrap.c:94 NCCL WARN Call to ibv_open_device failed
+
+[2026-03-18 14:38:12] pgut1-0:871780:871857 [1] p2p_plugin.c:565 NCCL WARN NET/IB : Unable to open device mlx5_an0
+
+[2026-03-18 14:38:12] pgut1-0:871781:871858 [2] ibvwrap.c:94 NCCL WARN Call to ibv_open_device failed
+
+[2026-03-18 14:38:12] pgut1-0:871781:871858 [2] p2p_plugin.c:565 NCCL WARN NET/IB : Unable to open device mlx5_an0
+
+[2026-03-18 14:38:12] pgut1-0:871783:871859 [4] ibvwrap.c:94 NCCL WARN Call to ibv_open_device failed
+
+[2026-03-18 14:38:12] pgut1-0:871783:871859 [4] p2p_plugin.c:565 NCCL WARN NET/IB : Unable to open device mlx5_an0
+
+[2026-03-18 14:38:12] pgut1-0:871782:871864 [3] ibvwrap.c:94 NCCL WARN Call to ibv_open_device failed
+
+[2026-03-18 14:38:12] pgut1-0:871782:871864 [3] p2p_plugin.c:565 NCCL WARN NET/IB : Unable to open device mlx5_an0
+
+[2026-03-18 14:38:12] pgut1-0:871785:871865 [6] ibvwrap.c:94 NCCL WARN Call to ibv_open_device failed
+
+[2026-03-18 14:38:12] pgut1-0:871785:871865 [6] p2p_plugin.c:565 NCCL WARN NET/IB : Unable to open device mlx5_an0
+logs/hf_verify_sp1024_8gpu.txt
+val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=/root/code/parameter-golf/data/tokenizers/fineweb_1024_bpe.model
+train_loader:dataset:fineweb10B_sp1024 train_shards:25
+val_loader:shards pattern=/root/code/parameter-golf/data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:63779840
+[rank0]:[W318 14:38:18.833454927 Utils.hpp:112] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function operator())
+model_params:17059912
+world_size:8 grad_accum_steps:1
+sdp_backends:cudnn=False flash=True mem_efficient=False math=False
+attention_mode:gqa num_heads:8 num_kv_heads:4
+tie_embeddings:True embed_lr:0.05 head_lr:0.0 matrix_lr:0.04 scalar_lr:0.04
+train_batch_tokens:524288 train_seq_len:1024 iterations:20000 warmup_steps:20 max_wallclock_seconds:600.000
+seed:1337
+[rank3]:[W318 14:38:18.835915381 Utils.hpp:112] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function operator())
+[rank7]:[W318 14:38:18.835951425 Utils.hpp:112] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function operator())
+[rank6]:[W318 14:38:18.835967008 Utils.hpp:112] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function operator())
+[rank2]:[W318 14:38:18.836023454 Utils.hpp:112] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function operator())
+[rank5]:[W318 14:38:18.836119632 Utils.hpp:112] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function operator())
+[rank4]:[W318 14:38:18.836127772 Utils.hpp:112] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function operator())
+[rank1]:[W318 14:38:18.836354967 Utils.hpp:112] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function operator())
+warmup_step:1/20
+warmup_step:2/20
+warmup_step:3/20
+warmup_step:4/20
+warmup_step:5/20
+warmup_step:6/20
+warmup_step:7/20
+warmup_step:8/20
+warmup_step:9/20
+warmup_step:10/20
+warmup_step:11/20
+warmup_step:12/20
+warmup_step:13/20
+warmup_step:14/20
+warmup_step:15/20
+warmup_step:16/20
+warmup_step:17/20
+warmup_step:18/20
+warmup_step:19/20
+warmup_step:20/20
+step:0/20000 val_loss:6.9370 val_bpb:4.0978 train_time:0ms step_avg:0.01ms
+step:1/20000 train_loss:6.9408 train_time:24ms step_avg:23.99ms
+step:2/20000 train_loss:16.8763 train_time:67ms step_avg:33.39ms
+step:3/20000 train_loss:9.0044 train_time:110ms step_avg:36.62ms
+step:4/20000 train_loss:6.5686 train_time:152ms step_avg:37.99ms
+step:5/20000 train_loss:6.6665 train_time:195ms step_avg:38.97ms
+step:6/20000 train_loss:6.5027 train_time:239ms step_avg:39.81ms
+step:7/20000 train_loss:6.2808 train_time:280ms step_avg:40.05ms
+step:8/20000 train_loss:5.9951 train_time:324ms step_avg:40.52ms
+step:9/20000 train_loss:6.0187 train_time:367ms step_avg:40.77ms
+step:10/20000 train_loss:5.9718 train_time:409ms step_avg:40.93ms
+step:50/20000 train_loss:3.9508 train_time:2126ms step_avg:42.52ms
+step:100/20000 train_loss:3.3373 train_time:4267ms step_avg:42.67ms
+step:150/20000 train_loss:2.9651 train_time:6414ms step_avg:42.76ms
+step:200/20000 train_loss:2.8041 train_time:8677ms step_avg:43.38ms
+step:200/20000 val_loss:2.8397 val_bpb:1.6774 train_time:8699ms step_avg:43.49ms
+step:250/20000 train_loss:2.7379 train_time:10816ms step_avg:43.27ms
+step:300/20000 train_loss:2.6613 train_time:12958ms step_avg:43.19ms
+step:350/20000 train_loss:2.6434 train_time:15097ms step_avg:43.13ms
+step:400/20000 train_loss:2.7684 train_time:17357ms step_avg:43.39ms
+step:400/20000 val_loss:2.5687 val_bpb:1.5174 train_time:17382ms step_avg:43.45ms
+step:450/20000 train_loss:2.6035 train_time:19502ms step_avg:43.34ms
+step:500/20000 train_loss:2.5265 train_time:21643ms step_avg:43.29ms
+step:550/20000 train_loss:2.4803 train_time:23782ms step_avg:43.24ms
+step:600/20000 train_loss:2.4731 train_time:26034ms step_avg:43.39ms
+step:600/20000 val_loss:2.4456 val_bpb:1.4447 train_time:26059ms step_avg:43.43ms
+step:650/20000 train_loss:2.3204 train_time:28175ms step_avg:43.35ms
+step:700/20000 train_loss:2.5926 train_time:30315ms step_avg:43.31ms
+step:750/20000 train_loss:2.4301 train_time:32457ms step_avg:43.28ms
+step:800/20000 train_loss:2.4775 train_time:34707ms step_avg:43.38ms
+step:800/20000 val_loss:2.3868 val_bpb:1.4099 train_time:34732ms step_avg:43.42ms
+step:850/20000 train_loss:2.3941 train_time:36851ms step_avg:43.35ms
+step:900/20000 train_loss:2.3716 train_time:38990ms step_avg:43.32ms
+step:950/20000 train_loss:2.3216 train_time:41131ms step_avg:43.30ms
+step:1000/20000 train_loss:2.3030 train_time:43390ms step_avg:43.39ms
+step:1000/20000 val_loss:2.3370 val_bpb:1.3805 train_time:43415ms step_avg:43.42ms
+step:1050/20000 train_loss:2.3893 train_time:45532ms step_avg:43.36ms
+step:1100/20000 train_loss:2.4145 train_time:47675ms step_avg:43.34ms
+step:1150/20000 train_loss:2.2261 train_time:49933ms step_avg:43.42ms
+step:1200/20000 train_loss:2.2607 train_time:52072ms step_avg:43.39ms
+step:1200/20000 val_loss:2.3026 val_bpb:1.3602 train_time:52097ms step_avg:43.41ms
+step:1250/20000 train_loss:2.3312 train_time:54219ms step_avg:43.38ms
+step:1300/20000 train_loss:2.3575 train_time:56363ms step_avg:43.36ms
+step:1350/20000 train_loss:2.2774 train_time:58628ms step_avg:43.43ms
+step:1400/20000 train_loss:2.2436 train_time:60772ms step_avg:43.41ms
+step:1400/20000 val_loss:2.2812 val_bpb:1.3475 train_time:60797ms step_avg:43.43ms
+step:1450/20000 train_loss:2.3006 train_time:62917ms step_avg:43.39ms
+step:1500/20000 train_loss:2.2831 train_time:65060ms step_avg:43.37ms
+step:1550/20000 train_loss:2.2957 train_time:67324ms step_avg:43.43ms
+step:1600/20000 train_loss:2.2187 train_time:69467ms step_avg:43.42ms
+step:1600/20000 val_loss:2.2631 val_bpb:1.3368 train_time:69491ms step_avg:43.43ms
+step:1650/20000 train_loss:2.2629 train_time:71614ms step_avg:43.40ms
+step:1700/20000 train_loss:2.2619 train_time:73759ms step_avg:43.39ms
+step:1750/20000 train_loss:2.1068 train_time:76028ms step_avg:43.44ms
+step:1800/20000 train_loss:2.3312 train_time:78171ms step_avg:43.43ms
+step:1800/20000 val_loss:2.2479 val_bpb:1.3279 train_time:78197ms step_avg:43.44ms
+step:1850/20000 train_loss:2.2211 train_time:80317ms step_avg:43.41ms
+step:1900/20000 train_loss:2.2477 train_time:82462ms step_avg:43.40ms
+step:1950/20000 train_loss:2.2707 train_time:84723ms step_avg:43.45ms
+step:2000/20000 train_loss:2.2346 train_time:86867ms step_avg:43.43ms
+step:2000/20000 val_loss:2.2368 val_bpb:1.3213 train_time:86892ms step_avg:43.45ms
+step:2050/20000 train_loss:2.0689 train_time:89013ms step_avg:43.42ms
+step:2100/20000 train_loss:2.3382 train_time:91276ms step_avg:43.46ms
+step:2150/20000 train_loss:2.1161 train_time:93418ms step_avg:43.45ms
+step:2200/20000 train_loss:2.2380 train_time:95565ms step_avg:43.44ms
+step:2200/20000 val_loss:2.2251 val_bpb:1.3144 train_time:95590ms step_avg:43.45ms
+step:2250/20000 train_loss:2.2362 train_time:97711ms step_avg:43.43ms
+step:2300/20000 train_loss:2.2390 train_time:99973ms step_avg:43.47ms
+step:2350/20000 train_loss:2.1494 train_time:102118ms step_avg:43.45ms
+step:2400/20000 train_loss:2.1004 train_time:104264ms step_avg:43.44ms
+step:2400/20000 val_loss:2.2158 val_bpb:1.3089 train_time:104288ms step_avg:43.45ms
+step:2450/20000 train_loss:2.2078 train_time:106409ms step_avg:43.43ms
+step:2500/20000 train_loss:2.2990 train_time:108679ms step_avg:43.47ms
+step:2550/20000 train_loss:2.3510 train_time:110825ms step_avg:43.46ms
+step:2600/20000 train_loss:2.1989 train_time:112969ms step_avg:43.45ms
+step:2600/20000 val_loss:2.2097 val_bpb:1.3053 train_time:112994ms step_avg:43.46ms
+step:2650/20000 train_loss:2.0953 train_time:115115ms step_avg:43.44ms
+step:2700/20000 train_loss:2.2119 train_time:117382ms step_avg:43.47ms
+step:2750/20000 train_loss:2.2833 train_time:119524ms step_avg:43.46ms
+step:2800/20000 train_loss:2.2056 train_time:121673ms step_avg:43.45ms
+step:2800/20000 val_loss:2.2011 val_bpb:1.3002 train_time:121697ms step_avg:43.46ms
+step:2850/20000 train_loss:2.1613 train_time:123815ms step_avg:43.44ms
+step:2900/20000 train_loss:2.2400 train_time:126078ms step_avg:43.48ms
+step:2950/20000 train_loss:2.2531 train_time:128222ms step_avg:43.47ms
+step:3000/20000 train_loss:2.1098 train_time:130368ms step_avg:43.46ms
+step:3000/20000 val_loss:2.1953 val_bpb:1.2968 train_time:130392ms step_avg:43.46ms
+step:3050/20000 train_loss:2.4246 train_time:132514ms step_avg:43.45ms
+step:3100/20000 train_loss:2.1884 train_time:134780ms step_avg:43.48ms
+step:3150/20000 train_loss:2.2749 train_time:136926ms step_avg:43.47ms
+step:3200/20000 train_loss:2.1492 train_time:139071ms step_avg:43.46ms
+step:3200/20000 val_loss:2.1881 val_bpb:1.2925 train_time:139096ms step_avg:43.47ms
+step:3250/20000 train_loss:2.1286 train_time:141341ms step_avg:43.49ms
+step:3300/20000 train_loss:2.1058 train_time:143485ms step_avg:43.48ms
+step:3350/20000 train_loss:2.2214 train_time:145628ms step_avg:43.47ms
+step:3400/20000 train_loss:2.2454 train_time:147773ms step_avg:43.46ms
+step:3400/20000 val_loss:2.1854 val_bpb:1.2909 train_time:147798ms step_avg:43.47ms
+step:3450/20000 train_loss:2.2601 train_time:150039ms step_avg:43.49ms
+step:3500/20000 train_loss:2.1183 train_time:152184ms step_avg:43.48ms
+step:3550/20000 train_loss:2.0846 train_time:154329ms step_avg:43.47ms
+step:3600/20000 train_loss:2.2507 train_time:156472ms step_avg:43.46ms
+step:3600/20000 val_loss:2.1784 val_bpb:1.2868 train_time:156496ms step_avg:43.47ms
+step:3650/20000 train_loss:2.1383 train_time:158738ms step_avg:43.49ms
+step:3700/20000 train_loss:2.2848 train_time:160882ms step_avg:43.48ms
+step:3750/20000 train_loss:2.1982 train_time:163029ms step_avg:43.47ms
+step:3800/20000 train_loss:2.1399 train_time:165176ms step_avg:43.47ms
+step:3800/20000 val_loss:2.1767 val_bpb:1.2858 train_time:165200ms step_avg:43.47ms
+step:3850/20000 train_loss:2.3361 train_time:167438ms step_avg:43.49ms
+step:3900/20000 train_loss:2.2756 train_time:169582ms step_avg:43.48ms
+step:3950/20000 train_loss:2.1261 train_time:171729ms step_avg:43.48ms
+step:4000/20000 train_loss:2.1437 train_time:173878ms step_avg:43.47ms
+step:4000/20000 val_loss:2.1718 val_bpb:1.2829 train_time:173903ms step_avg:43.48ms
+step:4050/20000 train_loss:2.1718 train_time:176147ms step_avg:43.49ms
+step:4100/20000 train_loss:2.1899 train_time:178291ms step_avg:43.49ms
+step:4150/20000 train_loss:2.1285 train_time:180438ms step_avg:43.48ms
+step:4200/20000 train_loss:2.0498 train_time:182707ms step_avg:43.50ms
+step:4200/20000 val_loss:2.1666 val_bpb:1.2798 train_time:182731ms step_avg:43.51ms
+step:4250/20000 train_loss:2.2487 train_time:184852ms step_avg:43.49ms
+step:4300/20000 train_loss:2.1979 train_time:186996ms step_avg:43.49ms
+step:4350/20000 train_loss:2.1314 train_time:189141ms step_avg:43.48ms
+step:4400/20000 train_loss:2.1727 train_time:191402ms step_avg:43.50ms
+step:4400/20000 val_loss:2.1625 val_bpb:1.2774 train_time:191427ms step_avg:43.51ms
+step:4450/20000 train_loss:2.1882 train_time:193549ms step_avg:43.49ms
+step:4500/20000 train_loss:2.0735 train_time:195696ms step_avg:43.49ms
+step:4550/20000 train_loss:2.1347 train_time:197840ms step_avg:43.48ms
+step:4600/20000 train_loss:2.1710 train_time:200091ms step_avg:43.50ms
+step:4600/20000 val_loss:2.1597 val_bpb:1.2757 train_time:200114ms step_avg:43.50ms
+step:4650/20000 train_loss:2.2563 train_time:202236ms step_avg:43.49ms
+step:4700/20000 train_loss:2.2077 train_time:204381ms step_avg:43.49ms
+step:4750/20000 train_loss:2.1328 train_time:206643ms step_avg:43.50ms
+step:4800/20000 train_loss:2.1473 train_time:208788ms step_avg:43.50ms
+step:4800/20000 val_loss:2.1579 val_bpb:1.2747 train_time:208812ms step_avg:43.50ms
+step:4850/20000 train_loss:2.2067 train_time:210933ms step_avg:43.49ms
+step:4900/20000 train_loss:2.1119 train_time:213078ms step_avg:43.49ms
+step:4950/20000 train_loss:2.0031 train_time:215339ms step_avg:43.50ms
+step:5000/20000 train_loss:2.1104 train_time:217483ms step_avg:43.50ms
+step:5000/20000 val_loss:2.1532 val_bpb:1.2719 train_time:217508ms step_avg:43.50ms
+step:5050/20000 train_loss:2.0232 train_time:219627ms step_avg:43.49ms
+step:5100/20000 train_loss:2.1995 train_time:221774ms step_avg:43.49ms
+step:5150/20000 train_loss:2.0709 train_time:224038ms step_avg:43.50ms
+step:5200/20000 train_loss:2.0972 train_time:226182ms step_avg:43.50ms
+step:5200/20000 val_loss:2.1501 val_bpb:1.2701 train_time:226207ms step_avg:43.50ms
+step:5250/20000 train_loss:2.1395 train_time:228330ms step_avg:43.49ms
+step:5300/20000 train_loss:2.0947 train_time:230476ms step_avg:43.49ms
+step:5350/20000 train_loss:2.0819 train_time:232740ms step_avg:43.50ms
+step:5400/20000 train_loss:2.2099 train_time:234884ms step_avg:43.50ms
+step:5400/20000 val_loss:2.1475 val_bpb:1.2685 train_time:234909ms step_avg:43.50ms
+step:5450/20000 train_loss:2.1314 train_time:237031ms step_avg:43.49ms
+step:5500/20000 train_loss:2.2057 train_time:239295ms step_avg:43.51ms
+step:5550/20000 train_loss:2.0856 train_time:241437ms step_avg:43.50ms
+step:5600/20000 train_loss:2.1448 train_time:243583ms step_avg:43.50ms
+step:5600/20000 val_loss:2.1455 val_bpb:1.2674 train_time:243608ms step_avg:43.50ms
+step:5650/20000 train_loss:2.0312 train_time:245730ms step_avg:43.49ms
+step:5700/20000 train_loss:2.1392 train_time:247996ms step_avg:43.51ms
+step:5750/20000 train_loss:2.0206 train_time:250140ms step_avg:43.50ms
+step:5800/20000 train_loss:2.2107 train_time:252283ms step_avg:43.50ms
+step:5800/20000 val_loss:2.1439 val_bpb:1.2664 train_time:252308ms step_avg:43.50ms
+step:5850/20000 train_loss:2.0973 train_time:254429ms step_avg:43.49ms
+step:5900/20000 train_loss:2.1270 train_time:256697ms step_avg:43.51ms
+step:5950/20000 train_loss:2.0899 train_time:258840ms step_avg:43.50ms
+step:6000/20000 train_loss:2.2182 train_time:260985ms step_avg:43.50ms
+step:6000/20000 val_loss:2.1445 val_bpb:1.2668 train_time:261009ms step_avg:43.50ms
+step:6050/20000 train_loss:2.1230 train_time:263130ms step_avg:43.49ms
+step:6100/20000 train_loss:2.1640 train_time:265401ms step_avg:43.51ms
+step:6150/20000 train_loss:2.1960 train_time:267547ms step_avg:43.50ms
+step:6200/20000 train_loss:2.1217 train_time:269692ms step_avg:43.50ms
+step:6200/20000 val_loss:2.1416 val_bpb:1.2651 train_time:269717ms step_avg:43.50ms
+step:6250/20000 train_loss:2.1106 train_time:271837ms step_avg:43.49ms
+step:6300/20000 train_loss:2.1989 train_time:274105ms step_avg:43.51ms
+step:6350/20000 train_loss:2.1738 train_time:276249ms step_avg:43.50ms
+step:6400/20000 train_loss:2.1333 train_time:278396ms step_avg:43.50ms
+step:6400/20000 val_loss:2.1377 val_bpb:1.2628 train_time:278421ms step_avg:43.50ms
+step:6450/20000 train_loss:1.9696 train_time:280544ms step_avg:43.50ms
+step:6500/20000 train_loss:2.1279 train_time:282815ms step_avg:43.51ms
+step:6550/20000 train_loss:2.2768 train_time:284958ms step_avg:43.51ms
+step:6600/20000 train_loss:2.1060 train_time:287102ms step_avg:43.50ms
+step:6600/20000 val_loss:2.1354 val_bpb:1.2614 train_time:287126ms step_avg:43.50ms
+step:6650/20000 train_loss:2.1036 train_time:289368ms step_avg:43.51ms
+step:6700/20000 train_loss:2.1438 train_time:291511ms step_avg:43.51ms
+step:6750/20000 train_loss:1.8938 train_time:293654ms step_avg:43.50ms
+step:6800/20000 train_loss:2.1809 train_time:295799ms step_avg:43.50ms
+step:6800/20000 val_loss:2.1342 val_bpb:1.2607 train_time:295824ms step_avg:43.50ms
+step:6850/20000 train_loss:2.0978 train_time:298068ms step_avg:43.51ms
+step:6900/20000 train_loss:2.1146 train_time:300210ms step_avg:43.51ms
+step:6950/20000 train_loss:2.1328 train_time:302354ms step_avg:43.50ms
+step:7000/20000 train_loss:2.1537 train_time:304499ms step_avg:43.50ms
+step:7000/20000 val_loss:2.1326 val_bpb:1.2598 train_time:304523ms step_avg:43.50ms
+step:7050/20000 train_loss:2.1382 train_time:306765ms step_avg:43.51ms
+step:7100/20000 train_loss:2.1078 train_time:308911ms step_avg:43.51ms
+step:7150/20000 train_loss:2.1952 train_time:311056ms step_avg:43.50ms
+step:7200/20000 train_loss:2.1143 train_time:313204ms step_avg:43.50ms
+step:7200/20000 val_loss:2.1299 val_bpb:1.2582 train_time:313228ms step_avg:43.50ms
+step:7250/20000 train_loss:2.1009 train_time:315469ms step_avg:43.51ms
+step:7300/20000 train_loss:2.1529 train_time:317612ms step_avg:43.51ms
+step:7350/20000 train_loss:2.1532 train_time:319759ms step_avg:43.50ms
+step:7400/20000 train_loss:2.1137 train_time:321901ms step_avg:43.50ms
+step:7400/20000 val_loss:2.1282 val_bpb:1.2572 train_time:321927ms step_avg:43.50ms
+step:7450/20000 train_loss:2.4067 train_time:324167ms step_avg:43.51ms
+step:7500/20000 train_loss:2.0751 train_time:326311ms step_avg:43.51ms
+step:7550/20000 train_loss:2.1258 train_time:328457ms step_avg:43.50ms
+step:7600/20000 train_loss:2.1723 train_time:330730ms step_avg:43.52ms
+step:7600/20000 val_loss:2.1289 val_bpb:1.2576 train_time:330754ms step_avg:43.52ms
+step:7650/20000 train_loss:2.2193 train_time:332878ms step_avg:43.51ms
+step:7700/20000 train_loss:2.1329 train_time:335023ms step_avg:43.51ms
+step:7750/20000 train_loss:2.0562 train_time:337169ms step_avg:43.51ms
+step:7800/20000 train_loss:2.1669 train_time:339436ms step_avg:43.52ms
+step:7800/20000 val_loss:2.1252 val_bpb:1.2554 train_time:339460ms step_avg:43.52ms
+step:7850/20000 train_loss:2.0994 train_time:341583ms step_avg:43.51ms
+step:7900/20000 train_loss:2.1585 train_time:343729ms step_avg:43.51ms
+step:7950/20000 train_loss:2.1319 train_time:345873ms step_avg:43.51ms
+step:8000/20000 train_loss:2.2613 train_time:348141ms step_avg:43.52ms
+step:8000/20000 val_loss:2.1232 val_bpb:1.2542 train_time:348165ms step_avg:43.52ms
+step:8050/20000 train_loss:2.1775 train_time:350287ms step_avg:43.51ms
+step:8100/20000 train_loss:1.9587 train_time:352431ms step_avg:43.51ms
+step:8150/20000 train_loss:2.0401 train_time:354575ms step_avg:43.51ms
+step:8200/20000 train_loss:2.1076 train_time:356845ms step_avg:43.52ms
+step:8200/20000 val_loss:2.1228 val_bpb:1.2540 train_time:356869ms step_avg:43.52ms
+step:8250/20000 train_loss:2.0951 train_time:358988ms step_avg:43.51ms
+step:8300/20000 train_loss:2.2244 train_time:361133ms step_avg:43.51ms
+step:8350/20000 train_loss:2.0681 train_time:363279ms step_avg:43.51ms
+step:8400/20000 train_loss:2.1494 train_time:365552ms step_avg:43.52ms
+step:8400/20000 val_loss:2.1201 val_bpb:1.2524 train_time:365577ms step_avg:43.52ms
+step:8450/20000 train_loss:2.1278 train_time:367698ms step_avg:43.51ms
+step:8500/20000 train_loss:2.0289 train_time:369845ms step_avg:43.51ms
+step:8550/20000 train_loss:2.0465 train_time:372114ms step_avg:43.52ms
+step:8600/20000 train_loss:2.0682 train_time:374259ms step_avg:43.52ms
+step:8600/20000 val_loss:2.1206 val_bpb:1.2526 train_time:374282ms step_avg:43.52ms
+step:8650/20000 train_loss:2.2717 train_time:376403ms step_avg:43.51ms
+step:8700/20000 train_loss:2.1795 train_time:378549ms step_avg:43.51ms
+step:8750/20000 train_loss:2.0492 train_time:380817ms step_avg:43.52ms
+step:8800/20000 train_loss:2.1100 train_time:382964ms step_avg:43.52ms
+step:8800/20000 val_loss:2.1192 val_bpb:1.2518 train_time:382989ms step_avg:43.52ms
+step:8850/20000 train_loss:2.4323 train_time:385110ms step_avg:43.52ms
+step:8900/20000 train_loss:2.1016 train_time:387258ms step_avg:43.51ms
+step:8950/20000 train_loss:2.0290 train_time:389530ms step_avg:43.52ms
+step:9000/20000 train_loss:2.1119 train_time:391675ms step_avg:43.52ms
+step:9000/20000 val_loss:2.1204 val_bpb:1.2525 train_time:391698ms step_avg:43.52ms
+step:9050/20000 train_loss:2.0826 train_time:393819ms step_avg:43.52ms
+step:9100/20000 train_loss:2.0427 train_time:395963ms step_avg:43.51ms
+step:9150/20000 train_loss:2.1201 train_time:398238ms step_avg:43.52ms
+step:9200/20000 train_loss:2.1490 train_time:400385ms step_avg:43.52ms
+step:9200/20000 val_loss:2.1170 val_bpb:1.2505 train_time:400409ms step_avg:43.52ms
+step:9250/20000 train_loss:2.1221 train_time:402534ms step_avg:43.52ms
+step:9300/20000 train_loss:2.4550 train_time:404680ms step_avg:43.51ms
+step:9350/20000 train_loss:2.0384 train_time:406932ms step_avg:43.52ms
+step:9400/20000 train_loss:2.0736 train_time:409077ms step_avg:43.52ms
+step:9400/20000 val_loss:2.1139 val_bpb:1.2487 train_time:409102ms step_avg:43.52ms
+step:9450/20000 train_loss:2.1096 train_time:411223ms step_avg:43.52ms
+step:9500/20000 train_loss:2.1070 train_time:413493ms step_avg:43.53ms
+step:9550/20000 train_loss:2.0249 train_time:415641ms step_avg:43.52ms
+step:9600/20000 train_loss:2.1141 train_time:417785ms step_avg:43.52ms
+step:9600/20000 val_loss:2.1138 val_bpb:1.2486 train_time:417809ms step_avg:43.52ms
+step:9650/20000 train_loss:2.0183 train_time:419932ms step_avg:43.52ms
+step:9700/20000 train_loss:2.1482 train_time:422212ms step_avg:43.53ms
+step:9750/20000 train_loss:2.1811 train_time:424359ms step_avg:43.52ms
+step:9800/20000 train_loss:2.1011 train_time:426503ms step_avg:43.52ms
+step:9800/20000 val_loss:2.1143 val_bpb:1.2489 train_time:426528ms step_avg:43.52ms
+step:9850/20000 train_loss:2.1134 train_time:428771ms step_avg:43.53ms
+step:9900/20000 train_loss:2.0497 train_time:430915ms step_avg:43.53ms
+step:9950/20000 train_loss:2.1989 train_time:433061ms step_avg:43.52ms
+step:10000/20000 train_loss:2.1982 train_time:435207ms step_avg:43.52ms
+step:10000/20000 val_loss:2.1122 val_bpb:1.2477 train_time:435232ms step_avg:43.52ms
+step:10050/20000 train_loss:2.0940 train_time:437485ms step_avg:43.53ms
+step:10100/20000 train_loss:2.1277 train_time:439630ms step_avg:43.53ms
+step:10150/20000 train_loss:2.0896 train_time:441773ms step_avg:43.52ms
+step:10200/20000 train_loss:2.0642 train_time:443918ms step_avg:43.52ms
+step:10200/20000 val_loss:2.1112 val_bpb:1.2471 train_time:443941ms step_avg:43.52ms
+step:10250/20000 train_loss:2.0627 train_time:446192ms step_avg:43.53ms
+step:10300/20000 train_loss:2.2191 train_time:448339ms step_avg:43.53ms
+step:10350/20000 train_loss:2.1354 train_time:450485ms step_avg:43.53ms
+step:10400/20000 train_loss:2.0705 train_time:452630ms step_avg:43.52ms
+step:10400/20000 val_loss:2.1098 val_bpb:1.2463 train_time:452654ms step_avg:43.52ms
+step:10450/20000 train_loss:2.0663 train_time:454900ms step_avg:43.53ms
+step:10500/20000 train_loss:2.1334 train_time:457046ms step_avg:43.53ms
+step:10550/20000 train_loss:2.1931 train_time:459192ms step_avg:43.53ms
+step:10600/20000 train_loss:2.0978 train_time:461337ms step_avg:43.52ms
+step:10600/20000 val_loss:2.1081 val_bpb:1.2453 train_time:461361ms step_avg:43.52ms
+step:10650/20000 train_loss:2.0676 train_time:463610ms step_avg:43.53ms
+step:10700/20000 train_loss:2.2333 train_time:465754ms step_avg:43.53ms
+step:10750/20000 train_loss:2.1661 train_time:467899ms step_avg:43.53ms
+step:10800/20000 train_loss:2.0966 train_time:470044ms step_avg:43.52ms
+step:10800/20000 val_loss:2.1081 val_bpb:1.2453 train_time:470069ms step_avg:43.52ms
+step:10850/20000 train_loss:2.0708 train_time:472323ms step_avg:43.53ms
+step:10900/20000 train_loss:2.1666 train_time:474468ms step_avg:43.53ms
+step:10950/20000 train_loss:2.1079 train_time:476615ms step_avg:43.53ms
+step:11000/20000 train_loss:2.0774 train_time:478893ms step_avg:43.54ms
+step:11000/20000 val_loss:2.1069 val_bpb:1.2446 train_time:478917ms step_avg:43.54ms
+step:11050/20000 train_loss:2.1288 train_time:481038ms step_avg:43.53ms
+step:11100/20000 train_loss:2.0801 train_time:483185ms step_avg:43.53ms
+step:11150/20000 train_loss:1.8743 train_time:485331ms step_avg:43.53ms
+step:11200/20000 train_loss:2.1471 train_time:487603ms step_avg:43.54ms
+step:11200/20000 val_loss:2.1080 val_bpb:1.2452 train_time:487627ms step_avg:43.54ms
+step:11250/20000 train_loss:2.2046 train_time:489748ms step_avg:43.53ms
+step:11300/20000 train_loss:2.0957 train_time:491892ms step_avg:43.53ms
+step:11350/20000 train_loss:2.0963 train_time:494038ms step_avg:43.53ms
+step:11400/20000 train_loss:2.3223 train_time:496318ms step_avg:43.54ms
+step:11400/20000 val_loss:2.1051 val_bpb:1.2435 train_time:496342ms step_avg:43.54ms
+step:11450/20000 train_loss:2.0724 train_time:498464ms step_avg:43.53ms
+step:11500/20000 train_loss:2.1197 train_time:500609ms step_avg:43.53ms
+step:11550/20000 train_loss:2.0975 train_time:502754ms step_avg:43.53ms
+step:11600/20000 train_loss:2.1091 train_time:505029ms step_avg:43.54ms
+step:11600/20000 val_loss:2.1054 val_bpb:1.2437 train_time:505053ms step_avg:43.54ms
+step:11650/20000 train_loss:2.1235 train_time:507175ms step_avg:43.53ms
+step:11700/20000 train_loss:2.0795 train_time:509324ms step_avg:43.53ms
+step:11750/20000 train_loss:2.0662 train_time:511469ms step_avg:43.53ms
+step:11800/20000 train_loss:2.0765 train_time:513742ms step_avg:43.54ms
+step:11800/20000 val_loss:2.1048 val_bpb:1.2433 train_time:513766ms step_avg:43.54ms
+step:11850/20000 train_loss:2.1202 train_time:515888ms step_avg:43.53ms
+step:11900/20000 train_loss:2.1029 train_time:518033ms step_avg:43.53ms
+step:11950/20000 train_loss:2.1512 train_time:520308ms step_avg:43.54ms
+step:12000/20000 train_loss:2.1814 train_time:522453ms step_avg:43.54ms
+step:12000/20000 val_loss:2.1029 val_bpb:1.2422 train_time:522477ms step_avg:43.54ms
+step:12050/20000 train_loss:2.1085 train_time:524601ms step_avg:43.54ms
+step:12100/20000 train_loss:2.0347 train_time:526747ms step_avg:43.53ms
+step:12150/20000 train_loss:2.0601 train_time:529018ms step_avg:43.54ms
+step:12200/20000 train_loss:2.0387 train_time:531162ms step_avg:43.54ms
+step:12200/20000 val_loss:2.1021 val_bpb:1.2418 train_time:531186ms step_avg:43.54ms
+step:12250/20000 train_loss:2.0381 train_time:533312ms step_avg:43.54ms
+step:12300/20000 train_loss:2.1302 train_time:535458ms step_avg:43.53ms
+step:12350/20000 train_loss:2.1272 train_time:537727ms step_avg:43.54ms
+step:12400/20000 train_loss:2.1828 train_time:539873ms step_avg:43.54ms
+step:12400/20000 val_loss:2.1001 val_bpb:1.2406 train_time:539897ms step_avg:43.54ms
+step:12450/20000 train_loss:2.1003 train_time:542019ms step_avg:43.54ms
+step:12500/20000 train_loss:2.0696 train_time:544164ms step_avg:43.53ms
+step:12550/20000 train_loss:2.1302 train_time:546436ms step_avg:43.54ms
+step:12600/20000 train_loss:2.0527 train_time:548582ms step_avg:43.54ms
+step:12600/20000 val_loss:2.0998 val_bpb:1.2404 train_time:548606ms step_avg:43.54ms
+step:12650/20000 train_loss:2.1438 train_time:550728ms step_avg:43.54ms
+step:12700/20000 train_loss:2.2689 train_time:552877ms step_avg:43.53ms
+step:12750/20000 train_loss:2.1438 train_time:555147ms step_avg:43.54ms
+step:12800/20000 train_loss:2.0105 train_time:557293ms step_avg:43.54ms
+step:12800/20000 val_loss:2.0930 val_bpb:1.2364 train_time:557317ms step_avg:43.54ms
+step:12850/20000 train_loss:2.0413 train_time:559440ms step_avg:43.54ms
+step:12900/20000 train_loss:2.0630 train_time:561586ms step_avg:43.53ms
+step:12950/20000 train_loss:2.1627 train_time:563863ms step_avg:43.54ms
+step:13000/20000 train_loss:1.9579 train_time:566009ms step_avg:43.54ms
+step:13000/20000 val_loss:2.0859 val_bpb:1.2322 train_time:566032ms step_avg:43.54ms
+step:13050/20000 train_loss:2.0206 train_time:568155ms step_avg:43.54ms
+step:13100/20000 train_loss:1.9294 train_time:570432ms step_avg:43.54ms
+step:13150/20000 train_loss:2.0689 train_time:572576ms step_avg:43.54ms
+step:13200/20000 train_loss:2.0074 train_time:574722ms step_avg:43.54ms
+step:13200/20000 val_loss:2.0790 val_bpb:1.2281 train_time:574747ms step_avg:43.54ms
+step:13250/20000 train_loss:2.0596 train_time:576871ms step_avg:43.54ms
+step:13300/20000 train_loss:1.9474 train_time:579143ms step_avg:43.54ms
+step:13350/20000 train_loss:2.0459 train_time:581289ms step_avg:43.54ms
+step:13400/20000 train_loss:2.0441 train_time:583434ms step_avg:43.54ms
+step:13400/20000 val_loss:2.0718 val_bpb:1.2239 train_time:583458ms step_avg:43.54ms
+step:13450/20000 train_loss:2.1638 train_time:585582ms step_avg:43.54ms
+step:13500/20000 train_loss:2.1216 train_time:587857ms step_avg:43.54ms
+step:13550/20000 train_loss:2.1855 train_time:590003ms step_avg:43.54ms
+step:13600/20000 train_loss:2.0234 train_time:592147ms step_avg:43.54ms
+step:13600/20000 val_loss:2.0649 val_bpb:1.2197 train_time:592172ms step_avg:43.54ms
+step:13650/20000 train_loss:2.0316 train_time:594295ms step_avg:43.54ms
+step:13700/20000 train_loss:2.0323 train_time:596577ms step_avg:43.55ms
+step:13750/20000 train_loss:1.9910 train_time:598726ms step_avg:43.54ms
+step:13780/20000 val_loss:2.0606 val_bpb:1.2172 train_time:600038ms step_avg:43.54ms
+stopping_early: wallclock_cap train_time:600038ms step:13780/20000
+peak memory allocated: 10184 MiB reserved: 10200 MiB
+Serialized model: 67224983 bytes
+Code size: 47642 bytes
+Total submission size: 67272625 bytes
+Serialized model int8+zlib: 15815847 bytes (payload:17178912 raw_torch:17224025 payload_ratio:3.91x)
+Total submission size int8+zlib: 15863489 bytes
+final_int8_zlib_roundtrip val_loss:2.0727 val_bpb:1.2244 eval_time:1401ms
+final_int8_zlib_roundtrip_exact val_loss:2.07269931 val_bpb:1.22436570
diff --git a/records/track_10min_16mb/2026-03-17_NaiveBaseline/train_gpt.py b/records/track_10min_16mb/2026-03-17_NaiveBaseline/train_gpt.py
new file mode 100644
index 0000000..0deb056
--- /dev/null
+++ b/records/track_10min_16mb/2026-03-17_NaiveBaseline/train_gpt.py
@@ -0,0 +1,1126 @@
+"""
+The `train_gpt.py` and `train_gpt_mlx.py` scripts are intended as good launching-off points for new participants, not SOTA configs. We'll accept PRs that tune, improve, or simplify these scripts without significantly increasing complexity, but competitive submissions should stay in the `/records` folder.
+
+Hard stop: `train_gpt.py` and `train_gpt_mlx.py` must never be longer than 1500 lines.
+"""
+
+from __future__ import annotations
+
+import copy
+import glob
+import io
+import math
+import os
+import random
+import subprocess
+import sys
+import time
+import uuid
+import zlib
+from pathlib import Path
+
+import numpy as np
+import sentencepiece as spm
+import torch
+import torch.distributed as dist
+import torch.nn.functional as F
+from torch import Tensor, nn
+from torch.nn.parallel import DistributedDataParallel as DDP
+
+# -----------------------------
+# HYPERPARAMETERS
+# -----------------------------
+# Default Simple Baseline run:
+# - 9 transformer blocks at width 512
+# - 8 attention heads with 4 KV heads (GQA) and 2x MLP expansion
+# - vocab size 1024, sequence length 1024, tied embeddings
+# - 524,288 train tokens per step for 20,000 iterations with a ~10 minute cap
+
+class Hyperparameters:
+    # Data paths are shard globs produced by the existing preprocessing pipeline.
+    data_path = os.environ.get("DATA_PATH", "./data/datasets/fineweb10B_sp1024")
+    train_files = os.path.join(data_path, "fineweb_train_*.bin")
+    val_files = os.path.join(data_path, "fineweb_val_*.bin")
+    tokenizer_path = os.environ.get("TOKENIZER_PATH", "./data/tokenizers/fineweb_1024_bpe.model")
+    run_id = os.environ.get("RUN_ID", str(uuid.uuid4()))
+    seed = int(os.environ.get("SEED", 1337))
+
+    # Validation cadence and batch size. Validation always uses the full fineweb_val split.
+    val_batch_size = int(os.environ.get("VAL_BATCH_SIZE", 524_288))
+    val_loss_every = int(os.environ.get("VAL_LOSS_EVERY", 1000))
+    train_log_every = int(os.environ.get("TRAIN_LOG_EVERY", 200))
+
+    # Training length.
+    iterations = int(os.environ.get("ITERATIONS", 20000))
+    warmdown_iters = int(os.environ.get("WARMDOWN_ITERS", 1200))
+    warmup_steps = int(os.environ.get("WARMUP_STEPS", 20))
+    train_batch_tokens = int(os.environ.get("TRAIN_BATCH_TOKENS", 524_288))
+    train_seq_len = int(os.environ.get("TRAIN_SEQ_LEN", 1024))
+    max_wallclock_seconds = float(os.environ.get("MAX_WALLCLOCK_SECONDS", 600.0))
+    qk_gain_init = float(os.environ.get("QK_GAIN_INIT", 1.5))
+
+    # Model shape.
+    vocab_size = int(os.environ.get("VOCAB_SIZE", 1024))
+    num_layers = int(os.environ.get("NUM_LAYERS", 9))
+    num_kv_heads = int(os.environ.get("NUM_KV_HEADS", 4))
+    model_dim = int(os.environ.get("MODEL_DIM", 512))
+    num_heads = int(os.environ.get("NUM_HEADS", 8))
+    mlp_mult = int(os.environ.get("MLP_MULT", 2))
+    tie_embeddings = bool(int(os.environ.get("TIE_EMBEDDINGS", "1")))
+    rope_base = float(os.environ.get("ROPE_BASE", 10000.0))
+    logit_softcap = float(os.environ.get("LOGIT_SOFTCAP", 30.0))
+
+    # Optimizer hyperparameters.
+    embed_lr = float(os.environ.get("EMBED_LR", 0.6))
+    head_lr = float(os.environ.get("HEAD_LR", 0.008))
+    tied_embed_lr = float(os.environ.get("TIED_EMBED_LR", 0.05))
+    tied_embed_init_std = float(os.environ.get("TIED_EMBED_INIT_STD", 0.005))
+    matrix_lr = float(os.environ.get("MATRIX_LR", 0.04))
+    scalar_lr = float(os.environ.get("SCALAR_LR", 0.04))
+    muon_momentum = float(os.environ.get("MUON_MOMENTUM", 0.95))
+    muon_backend_steps = int(os.environ.get("MUON_BACKEND_STEPS", 5))
+    muon_momentum_warmup_start = float(os.environ.get("MUON_MOMENTUM_WARMUP_START", 0.85))
+    muon_momentum_warmup_steps = int(os.environ.get("MUON_MOMENTUM_WARMUP_STEPS", 500))
+    beta1 = float(os.environ.get("BETA1", 0.9))
+    beta2 = float(os.environ.get("BETA2", 0.95))
+    adam_eps = float(os.environ.get("ADAM_EPS", 1e-8))
+    grad_clip_norm = float(os.environ.get("GRAD_CLIP_NORM", 0.0))
+
+# -----------------------------
+# MUON OPTIMIZER 
+# -----------------------------
+# 
+# As borrowed from modded-nanogpt
+# Background on Muon: https://kellerjordan.github.io/posts/muon/
+
+def zeropower_via_newtonschulz5(G: Tensor, steps: int = 10, eps: float = 1e-7) -> Tensor:
+    # Orthogonalize a 2D update matrix with a fast Newton-Schulz iteration.
+    # Muon uses this to normalize matrix-shaped gradients before applying them.
+    a, b, c = (3.4445, -4.7750, 2.0315)
+    X = G.bfloat16()
+    X /= X.norm() + eps
+    transposed = G.size(0) > G.size(1)
+    if transposed:
+        X = X.T
+    for _ in range(steps):
+        A = X @ X.T
+        B = b * A + c * A @ A
+        X = a * X + B @ X
+    return X.T if transposed else X
+
+
+class Muon(torch.optim.Optimizer):
+    def __init__(self, params, lr: float, momentum: float, backend_steps: int, nesterov: bool = True):
+        super().__init__(
+            params,
+            dict(lr=lr, momentum=momentum, backend_steps=backend_steps, nesterov=nesterov),
+        )
+
+    @torch.no_grad()
+    def step(self, closure=None):
+        loss = None
+        if closure is not None:
+            with torch.enable_grad():
+                loss = closure()
+
+        distributed = dist.is_available() and dist.is_initialized()
+        world_size = dist.get_world_size() if distributed else 1
+        rank = dist.get_rank() if distributed else 0
+
+        for group in self.param_groups:
+            params = group["params"]
+            if not params:
+                continue
+            lr = group["lr"]
+            momentum = group["momentum"]
+            backend_steps = group["backend_steps"]
+            nesterov = group["nesterov"]
+
+            total_params = sum(int(p.numel()) for p in params)
+            updates_flat = torch.zeros(total_params, device=params[0].device, dtype=torch.bfloat16)
+
+            curr = 0
+            for i, p in enumerate(params):
+                if i % world_size == rank and p.grad is not None:
+                    g = p.grad
+                    state = self.state[p]
+                    if "momentum_buffer" not in state:
+                        state["momentum_buffer"] = torch.zeros_like(g)
+                    buf = state["momentum_buffer"]
+                    buf.mul_(momentum).add_(g)
+                    if nesterov:
+                        g = g.add(buf, alpha=momentum)
+                    g = zeropower_via_newtonschulz5(g, steps=backend_steps)
+                    # Scale correction from Muon reference implementations.
+                    g *= max(1, g.size(0) / g.size(1)) ** 0.5
+                    updates_flat[curr : curr + p.numel()] = g.reshape(-1)
+                curr += p.numel()
+
+            if distributed:
+                dist.all_reduce(updates_flat, op=dist.ReduceOp.SUM)
+
+            curr = 0
+            for p in params:
+                g = updates_flat[curr : curr + p.numel()].view_as(p).to(dtype=p.dtype)
+                p.add_(g, alpha=-lr)
+                curr += p.numel()
+
+        return loss
+
+
+# -----------------------------
+# TOKENIZER-AGNOSTIC EVALUATION SETUP 
+# -----------------------------
+#
+# It's common for small models have a large fraction of their parameters be embeddings, since the 2 * d_model * d_vocab vectors can be gigantic.
+# Instead of locking the tokenizer, we let you bring your own and calculate our validation metrics on the average compression of the validation set.
+# We calculate BPB (bits-per-byte) instead of validation loss, so we need methods to count the number of bits per token in the tokenizer.
+# Note: Submissions that edit the tokenizer will be examined more carefully, since screwing this up might unjustly improve your score.
+
+def build_sentencepiece_luts(
+    sp: spm.SentencePieceProcessor, vocab_size: int, device: torch.device
+) -> tuple[Tensor, Tensor, Tensor]:
+    sp_vocab_size = int(sp.vocab_size())
+    table_size = max(sp_vocab_size, vocab_size)
+    base_bytes_np = np.zeros((table_size,), dtype=np.int16)
+    has_leading_space_np = np.zeros((table_size,), dtype=np.bool_)
+    is_boundary_token_np = np.ones((table_size,), dtype=np.bool_)
+    for token_id in range(sp_vocab_size):
+        if sp.is_control(token_id) or sp.is_unknown(token_id) or sp.is_unused(token_id):
+            continue
+        is_boundary_token_np[token_id] = False
+        if sp.is_byte(token_id):
+            base_bytes_np[token_id] = 1
+            continue
+        piece = sp.id_to_piece(token_id)
+        if piece.startswith("▁"):
+            has_leading_space_np[token_id] = True
+            piece = piece[1:]
+        base_bytes_np[token_id] = len(piece.encode("utf-8"))
+    return (
+        torch.tensor(base_bytes_np, dtype=torch.int16, device=device),
+        torch.tensor(has_leading_space_np, dtype=torch.bool, device=device),
+        torch.tensor(is_boundary_token_np, dtype=torch.bool, device=device),
+    )
+
+
+def load_validation_tokens(pattern: str, seq_len: int) -> Tensor:
+    files = [Path(p) for p in sorted(glob.glob(pattern))]
+    if not files:
+        raise FileNotFoundError(f"No files found for pattern: {pattern}")
+    # The export pipeline writes the fixed first-50k-doc validation set to fineweb_val_*.
+    tokens = torch.cat([load_data_shard(file) for file in files]).contiguous()
+    usable = ((tokens.numel() - 1) // seq_len) * seq_len
+    if usable <= 0:
+        raise ValueError(f"Validation split is too short for TRAIN_SEQ_LEN={seq_len}")
+    return tokens[: usable + 1]
+
+
+def eval_val(
+    args: Hyperparameters,
+    model: nn.Module,
+    rank: int,
+    world_size: int,
+    device: torch.device,
+    grad_accum_steps: int,
+    val_tokens: Tensor,
+    base_bytes_lut: Tensor,
+    has_leading_space_lut: Tensor,
+    is_boundary_token_lut: Tensor,
+) -> tuple[float, float]:
+    # Validation computes two metrics:
+    # - val_loss: token cross-entropy (natural log)
+    # - val_bpb: tokenizer-agnostic compression metric used by the challenge
+    local_batch_tokens = args.val_batch_size // (world_size * grad_accum_steps)
+    if local_batch_tokens < args.train_seq_len:
+        raise ValueError(
+            "VAL_BATCH_SIZE must provide at least one sequence per rank; "
+            f"got VAL_BATCH_SIZE={args.val_batch_size}, WORLD_SIZE={world_size}, "
+            f"GRAD_ACCUM_STEPS={grad_accum_steps}, TRAIN_SEQ_LEN={args.train_seq_len}"
+        )
+    local_batch_seqs = local_batch_tokens // args.train_seq_len
+    total_seqs = (val_tokens.numel() - 1) // args.train_seq_len
+    seq_start = (total_seqs * rank) // world_size
+    seq_end = (total_seqs * (rank + 1)) // world_size
+    val_loss_sum = torch.zeros((), device=device, dtype=torch.float64)
+    val_token_count = torch.zeros((), device=device, dtype=torch.float64)
+    val_byte_count = torch.zeros((), device=device, dtype=torch.float64)
+
+    model.eval()
+    with torch.inference_mode():
+        for batch_seq_start in range(seq_start, seq_end, local_batch_seqs):
+            batch_seq_end = min(batch_seq_start + local_batch_seqs, seq_end)
+            raw_start = batch_seq_start * args.train_seq_len
+            raw_end = batch_seq_end * args.train_seq_len + 1
+            local = val_tokens[raw_start:raw_end].to(device=device, dtype=torch.int64, non_blocking=True)
+            x = local[:-1].reshape(-1, args.train_seq_len)
+            y = local[1:].reshape(-1, args.train_seq_len)
+            with torch.autocast(device_type="cuda", dtype=torch.bfloat16, enabled=True):
+                batch_loss = model(x, y).detach()
+            batch_token_count = float(y.numel())
+            val_loss_sum += batch_loss.to(torch.float64) * batch_token_count
+            val_token_count += batch_token_count
+            prev_ids = x.reshape(-1)
+            tgt_ids = y.reshape(-1)
+            token_bytes = base_bytes_lut[tgt_ids].to(dtype=torch.int16)
+            token_bytes += (has_leading_space_lut[tgt_ids] & ~is_boundary_token_lut[prev_ids]).to(dtype=torch.int16)
+            val_byte_count += token_bytes.to(torch.float64).sum()
+
+    if dist.is_available() and dist.is_initialized():
+        dist.all_reduce(val_loss_sum, op=dist.ReduceOp.SUM)
+        dist.all_reduce(val_token_count, op=dist.ReduceOp.SUM)
+        dist.all_reduce(val_byte_count, op=dist.ReduceOp.SUM)
+
+    val_loss = val_loss_sum / val_token_count
+    bits_per_token = val_loss.item() / math.log(2.0)
+    tokens_per_byte = val_token_count.item() / val_byte_count.item()
+    model.train()
+    return float(val_loss.item()), float(bits_per_token * tokens_per_byte)
+
+# -----------------------------
+# POST-TRAINING QUANTIZATION
+# -----------------------------
+#
+# It's silly to export our model, which is trained in bf16 and fp32, at that same precision.
+# Instead, we get approximately the same model (with a small hit) by quantizing the model to int8 & zlib compressing.
+# We can then decompress the model and run in higher precision for evaluation, after closing in under the size limit.
+
+CONTROL_TENSOR_NAME_PATTERNS = tuple(
+    pattern
+    for pattern in os.environ.get(
+        "CONTROL_TENSOR_NAME_PATTERNS",
+        "attn_scale,attn_scales,mlp_scale,mlp_scales,resid_mix,resid_mixes,q_gain,skip_weight,skip_weights",
+    ).split(",")
+    if pattern
+)
+INT8_KEEP_FLOAT_FP32_NAME_PATTERNS = tuple(
+    pattern
+    for pattern in os.environ.get(
+        "INT8_KEEP_FLOAT_FP32_NAME_PATTERNS",
+        ",".join(CONTROL_TENSOR_NAME_PATTERNS),
+    ).split(",")
+    if pattern
+)
+INT8_KEEP_FLOAT_MAX_NUMEL = 65_536
+INT8_KEEP_FLOAT_STORE_DTYPE = torch.float16
+INT8_PER_ROW_SCALE_DTYPE = torch.float16
+INT8_CLIP_PERCENTILE = 99.99984
+INT8_CLIP_Q = INT8_CLIP_PERCENTILE / 100.0
+
+def tensor_nbytes(t: Tensor) -> int:
+    return int(t.numel()) * int(t.element_size())
+
+def keep_float_tensor(name: str, t: Tensor, passthrough_orig_dtypes: dict[str, str]) -> Tensor:
+    if any(pattern in name for pattern in INT8_KEEP_FLOAT_FP32_NAME_PATTERNS):
+        return t.float().contiguous()
+    if t.dtype in {torch.float32, torch.bfloat16}:
+        passthrough_orig_dtypes[name] = str(t.dtype).removeprefix("torch.")
+        return t.to(dtype=INT8_KEEP_FLOAT_STORE_DTYPE).contiguous()
+    return t
+
+def quantize_float_tensor(t: Tensor) -> tuple[Tensor, Tensor]:
+    t32 = t.float()
+    if t32.ndim == 2:
+        # Matrices get one scale per row, which usually tracks output-channel
+        # ranges much better than a single tensor-wide scale.
+        clip_abs = (
+            torch.quantile(t32.abs(), INT8_CLIP_Q, dim=1)
+            if t32.numel()
+            else torch.empty((t32.shape[0],), dtype=torch.float32)
+        )
+        clipped = torch.maximum(torch.minimum(t32, clip_abs[:, None]), -clip_abs[:, None])
+        scale = (clip_abs / 127.0).clamp_min(1.0 / 127.0)
+        q = torch.clamp(torch.round(clipped / scale[:, None]), -127, 127).to(torch.int8).contiguous()
+        return q, scale.to(dtype=INT8_PER_ROW_SCALE_DTYPE).contiguous()
+
+    # Vectors / scalars use a simpler per-tensor scale.
+    clip_abs = float(torch.quantile(t32.abs().flatten(), INT8_CLIP_Q).item()) if t32.numel() else 0.0
+    scale = torch.tensor(clip_abs / 127.0 if clip_abs > 0 else 1.0, dtype=torch.float32)
+    q = torch.clamp(torch.round(torch.clamp(t32, -clip_abs, clip_abs) / scale), -127, 127).to(torch.int8).contiguous()
+    return q, scale
+
+def quantize_state_dict_int8(state_dict: dict[str, Tensor]):
+    # Single supported clean-script export format:
+    # - per-row int8 for 2D float tensors
+    # - per-tensor int8 for other float tensors
+    # - exact passthrough for non-floats
+    # - passthrough for small float tensors, stored as fp16 to save bytes
+    quantized: dict[str, Tensor] = {}
+    scales: dict[str, Tensor] = {}
+    dtypes: dict[str, str] = {}
+    passthrough: dict[str, Tensor] = {}
+    passthrough_orig_dtypes: dict[str, str] = {}
+    qmeta: dict[str, dict[str, object]] = {}
+    stats = dict.fromkeys(
+        ("param_count", "num_tensors", "num_float_tensors", "num_nonfloat_tensors", "baseline_tensor_bytes", "int8_payload_bytes"),
+        0,
+    )
+
+    for name, tensor in state_dict.items():
+        t = tensor.detach().to("cpu").contiguous()
+        stats["param_count"] += int(t.numel())
+        stats["num_tensors"] += 1
+        stats["baseline_tensor_bytes"] += tensor_nbytes(t)
+
+        if not t.is_floating_point():
+            stats["num_nonfloat_tensors"] += 1
+            passthrough[name] = t
+            stats["int8_payload_bytes"] += tensor_nbytes(t)
+            continue
+
+        # Small float tensors are cheap enough to keep directly. We still downcast
+        # fp32/bf16 passthrough tensors to fp16 so metadata does not dominate size.
+        if t.numel() <= INT8_KEEP_FLOAT_MAX_NUMEL:
+            kept = keep_float_tensor(name, t, passthrough_orig_dtypes)
+            passthrough[name] = kept
+            stats["int8_payload_bytes"] += tensor_nbytes(kept)
+            continue
+
+        stats["num_float_tensors"] += 1
+        q, s = quantize_float_tensor(t)
+        if s.ndim > 0:
+            qmeta[name] = {"scheme": "per_row", "axis": 0}
+        quantized[name] = q
+        scales[name] = s
+        dtypes[name] = str(t.dtype).removeprefix("torch.")
+        stats["int8_payload_bytes"] += tensor_nbytes(q) + tensor_nbytes(s)
+
+    obj: dict[str, object] = {
+        "__quant_format__": "int8_clean_per_row_v1",
+        "quantized": quantized,
+        "scales": scales,
+        "dtypes": dtypes,
+        "passthrough": passthrough,
+    }
+    if qmeta:
+        obj["qmeta"] = qmeta
+    if passthrough_orig_dtypes:
+        obj["passthrough_orig_dtypes"] = passthrough_orig_dtypes
+    return obj, stats
+
+def dequantize_state_dict_int8(obj: dict[str, object]) -> dict[str, Tensor]:
+    out: dict[str, Tensor] = {}
+    qmeta = obj.get("qmeta", {})
+    passthrough_orig_dtypes = obj.get("passthrough_orig_dtypes", {})
+    for name, q in obj["quantized"].items():
+        dtype = getattr(torch, obj["dtypes"][name])
+        s = obj["scales"][name]
+        if qmeta.get(name, {}).get("scheme") == "per_row" or s.ndim > 0:
+            s = s.to(dtype=torch.float32)
+            # Broadcast the saved row scale back across trailing dimensions.
+            out[name] = (q.float() * s.view(q.shape[0], *([1] * (q.ndim - 1)))).to(dtype=dtype).contiguous()
+        else:
+            scale = float(s.item())
+            out[name] = (q.float() * scale).to(dtype=dtype).contiguous()
+    for name, t in obj["passthrough"].items():
+        # Restore small tensors, undoing the temporary fp16 storage cast if needed.
+        out_t = t.detach().to("cpu").contiguous()
+        orig_dtype = passthrough_orig_dtypes.get(name)
+        if isinstance(orig_dtype, str):
+            out_t = out_t.to(dtype=getattr(torch, orig_dtype)).contiguous()
+        out[name] = out_t
+    return out
+
+
+# -----------------------------
+# DATA LOADING 
+# -----------------------------
+
+def load_data_shard(file: Path) -> Tensor:
+    header_bytes = 256 * np.dtype("<i4").itemsize
+    token_bytes = np.dtype("<u2").itemsize
+    header = np.fromfile(file, dtype="<i4", count=256)
+    # SHARD HEADER INTS & SHARD_MAGIC
+    if header.size != 256 or int(header[0]) != 20240520 or int(header[1]) != 1:
+        raise ValueError(f"Unexpected shard header for {file}")
+    num_tokens = int(header[2])
+    expected_size = header_bytes + num_tokens * token_bytes
+    if file.stat().st_size != expected_size:
+        raise ValueError(f"Shard size mismatch for {file}: expected {expected_size} bytes")
+    tokens_np = np.fromfile(file, dtype="<u2", count=num_tokens, offset=header_bytes)
+    if tokens_np.size != num_tokens:
+        raise ValueError(f"Short read for {file}")
+    return torch.from_numpy(tokens_np.astype(np.uint16, copy=False))
+
+
+class TokenStream:
+    # Reads shards sequentially and wraps around forever. The training loop therefore
+    # has deterministic, simple streaming behavior with no sampling or workers.
+    def __init__(self, pattern: str):
+        self.files = [Path(p) for p in sorted(glob.glob(pattern))]
+        if not self.files:
+            raise FileNotFoundError(f"No files found for pattern: {pattern}")
+        self.file_idx = 0
+        self.tokens = load_data_shard(self.files[0])
+        self.pos = 0
+
+    def _advance_file(self) -> None:
+        self.file_idx = (self.file_idx + 1) % len(self.files)
+        self.tokens = load_data_shard(self.files[self.file_idx])
+        self.pos = 0
+
+    def take(self, n: int) -> Tensor:
+        chunks: list[Tensor] = []
+        remaining = n
+        while remaining > 0:
+            avail = self.tokens.numel() - self.pos
+            if avail <= 0:
+                self._advance_file()
+                continue
+            k = min(remaining, avail)
+            chunks.append(self.tokens[self.pos : self.pos + k])
+            self.pos += k
+            remaining -= k
+        return chunks[0] if len(chunks) == 1 else torch.cat(chunks)
+
+
+class DistributedTokenLoader:
+    # Each call consumes a contiguous chunk from the shared token stream, then slices out
+    # one disjoint span per rank. The extra "+1" token lets us build (x, y) by shifting.
+    def __init__(self, pattern: str, rank: int, world_size: int, device: torch.device):
+        self.rank = rank
+        self.world_size = world_size
+        self.device = device
+        self.stream = TokenStream(pattern)
+
+    def next_batch(self, global_tokens: int, seq_len: int, grad_accum_steps: int) -> tuple[Tensor, Tensor]:
+        local_tokens = global_tokens // (self.world_size * grad_accum_steps)
+        per_rank_span = local_tokens + 1
+        chunk = self.stream.take(per_rank_span * self.world_size)
+        start = self.rank * per_rank_span
+        local = chunk[start : start + per_rank_span].to(dtype=torch.int64)
+        x = local[:-1].reshape(-1, seq_len)
+        y = local[1:].reshape(-1, seq_len)
+        return x.to(self.device, non_blocking=True), y.to(self.device, non_blocking=True)
+
+# -----------------------------
+# TRANSFORMER MODULES
+# -----------------------------
+
+class RMSNorm(nn.Module):
+    def __init__(self, eps: float | None = None):
+        super().__init__()
+        self.eps = eps
+
+    def forward(self, x: Tensor) -> Tensor:
+        return F.rms_norm(x, (x.size(-1),), eps=self.eps)
+
+
+class CastedLinear(nn.Linear):
+    # Keep weights in fp32 for optimizer/state quality, cast at matmul time for bf16 compute.
+    def forward(self, x: Tensor) -> Tensor:
+        bias = self.bias.to(x.dtype) if self.bias is not None else None
+        return F.linear(x, self.weight.to(x.dtype), bias)
+
+
+def restore_low_dim_params_to_fp32(module: nn.Module) -> None:
+    # Keep small/control parameters in fp32 even when the model body runs in bf16.
+    with torch.no_grad():
+        for name, param in module.named_parameters():
+            if (param.ndim < 2 or any(pattern in name for pattern in CONTROL_TENSOR_NAME_PATTERNS)) and param.dtype != torch.float32:
+                param.data = param.data.float()
+
+
+class Rotary(nn.Module):
+    # Caches cos/sin tables per sequence length on the current device.
+    def __init__(self, dim: int, base: float = 10000.0):
+        super().__init__()
+        inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2, dtype=torch.float32) / dim))
+        self.register_buffer("inv_freq", inv_freq, persistent=False)
+        self._seq_len_cached = 0
+        self._cos_cached: Tensor | None = None
+        self._sin_cached: Tensor | None = None
+
+    def forward(self, seq_len: int, device: torch.device, dtype: torch.dtype) -> tuple[Tensor, Tensor]:
+        if (
+            self._cos_cached is None
+            or self._sin_cached is None
+            or self._seq_len_cached != seq_len
+            or self._cos_cached.device != device
+        ):
+            t = torch.arange(seq_len, device=device, dtype=self.inv_freq.dtype)
+            freqs = torch.outer(t, self.inv_freq.to(device))
+            self._cos_cached = freqs.cos()[None, None, :, :]
+            self._sin_cached = freqs.sin()[None, None, :, :]
+            self._seq_len_cached = seq_len
+        return self._cos_cached.to(dtype=dtype), self._sin_cached.to(dtype=dtype)
+
+
+def apply_rotary_emb(x: Tensor, cos: Tensor, sin: Tensor) -> Tensor:
+    half = x.size(-1) // 2
+    x1, x2 = x[..., :half], x[..., half:]
+    return torch.cat((x1 * cos + x2 * sin, x1 * (-sin) + x2 * cos), dim=-1)
+
+
+class CausalSelfAttention(nn.Module):
+    def __init__(
+        self,
+        dim: int,
+        num_heads: int,
+        num_kv_heads: int,
+        rope_base: float,
+        qk_gain_init: float,
+    ):
+        super().__init__()
+        if dim % num_heads != 0:
+            raise ValueError("model_dim must be divisible by num_heads")
+        if num_heads % num_kv_heads != 0:
+            raise ValueError("num_heads must be divisible by num_kv_heads")
+        self.num_heads = num_heads
+        self.num_kv_heads = num_kv_heads
+        self.head_dim = dim // num_heads
+        if self.head_dim % 2 != 0:
+            raise ValueError("head_dim must be even for RoPE")
+        kv_dim = self.num_kv_heads * self.head_dim
+        self.c_q = CastedLinear(dim, dim, bias=False)
+        self.c_k = CastedLinear(dim, kv_dim, bias=False)
+        self.c_v = CastedLinear(dim, kv_dim, bias=False)
+        self.proj = CastedLinear(dim, dim, bias=False)
+        self.proj._zero_init = True
+        self.q_gain = nn.Parameter(torch.full((num_heads,), qk_gain_init, dtype=torch.float32))
+        self.rotary = Rotary(self.head_dim, base=rope_base)
+
+    def forward(self, x: Tensor) -> Tensor:
+        bsz, seqlen, dim = x.shape
+        q = self.c_q(x).reshape(bsz, seqlen, self.num_heads, self.head_dim).transpose(1, 2)
+        k = self.c_k(x).reshape(bsz, seqlen, self.num_kv_heads, self.head_dim).transpose(1, 2)
+        v = self.c_v(x).reshape(bsz, seqlen, self.num_kv_heads, self.head_dim).transpose(1, 2)
+        q = F.rms_norm(q, (q.size(-1),))
+        k = F.rms_norm(k, (k.size(-1),))
+        cos, sin = self.rotary(seqlen, x.device, q.dtype)
+        q = apply_rotary_emb(q, cos, sin)
+        k = apply_rotary_emb(k, cos, sin)
+        q = q * self.q_gain.to(dtype=q.dtype)[None, :, None, None]
+        y = F.scaled_dot_product_attention(
+            q,
+            k,
+            v,
+            attn_mask=None,
+            is_causal=True,
+            enable_gqa=(self.num_kv_heads != self.num_heads),
+        )
+        y = y.transpose(1, 2).contiguous().reshape(bsz, seqlen, dim)
+        return self.proj(y)
+
+
+class MLP(nn.Module):
+    # relu^2 MLP from the original modded-nanogpt setup
+    def __init__(self, dim: int, mlp_mult: int):
+        super().__init__()
+        hidden = mlp_mult * dim
+        self.fc = CastedLinear(dim, hidden, bias=False)
+        self.proj = CastedLinear(hidden, dim, bias=False)
+        self.proj._zero_init = True
+
+    def forward(self, x: Tensor) -> Tensor:
+        x = torch.relu(self.fc(x))
+        return self.proj(x.square())
+
+
+class Block(nn.Module):
+    def __init__(
+        self,
+        dim: int,
+        num_heads: int,
+        num_kv_heads: int,
+        mlp_mult: int,
+        rope_base: float,
+        qk_gain_init: float,
+    ):
+        super().__init__()
+        self.attn_norm = RMSNorm()
+        self.mlp_norm = RMSNorm()
+        self.attn = CausalSelfAttention(dim, num_heads, num_kv_heads, rope_base, qk_gain_init)
+        self.mlp = MLP(dim, mlp_mult)
+        self.attn_scale = nn.Parameter(torch.ones(dim, dtype=torch.float32))
+        self.mlp_scale = nn.Parameter(torch.ones(dim, dtype=torch.float32))
+        self.resid_mix = nn.Parameter(torch.stack((torch.ones(dim), torch.zeros(dim))).float())
+
+    def forward(self, x: Tensor, x0: Tensor) -> Tensor:
+        mix = self.resid_mix.to(dtype=x.dtype)
+        x = mix[0][None, None, :] * x + mix[1][None, None, :] * x0
+        attn_out = self.attn(self.attn_norm(x))
+        x = x + self.attn_scale.to(dtype=x.dtype)[None, None, :] * attn_out
+        x = x + self.mlp_scale.to(dtype=x.dtype)[None, None, :] * self.mlp(self.mlp_norm(x))
+        return x
+
+
+class GPT(nn.Module):
+    def __init__(
+        self,
+        vocab_size: int,
+        num_layers: int,
+        model_dim: int,
+        num_heads: int,
+        num_kv_heads: int,
+        mlp_mult: int,
+        tie_embeddings: bool,
+        tied_embed_init_std: float,
+        logit_softcap: float,
+        rope_base: float,
+        qk_gain_init: float,
+    ):
+        super().__init__()
+        if logit_softcap <= 0.0:
+            raise ValueError(f"logit_softcap must be positive, got {logit_softcap}")
+        self.tie_embeddings = tie_embeddings
+        self.tied_embed_init_std = tied_embed_init_std
+        self.logit_softcap = logit_softcap
+        self.tok_emb = nn.Embedding(vocab_size, model_dim)
+        self.num_encoder_layers = num_layers // 2
+        self.num_decoder_layers = num_layers - self.num_encoder_layers
+        self.num_skip_weights = min(self.num_encoder_layers, self.num_decoder_layers)
+        self.skip_weights = nn.Parameter(torch.ones(self.num_skip_weights, model_dim, dtype=torch.float32))
+        self.blocks = nn.ModuleList(
+            [
+                Block(
+                    model_dim,
+                    num_heads,
+                    num_kv_heads,
+                    mlp_mult,
+                    rope_base,
+                    qk_gain_init,
+                )
+                for i in range(num_layers)
+            ]
+        )
+        self.final_norm = RMSNorm()
+        self.lm_head = None if tie_embeddings else CastedLinear(model_dim, vocab_size, bias=False)
+        if self.lm_head is not None:
+            self.lm_head._zero_init = True
+        self._init_weights()
+
+    def _init_weights(self) -> None:
+        if self.tie_embeddings:
+            nn.init.normal_(self.tok_emb.weight, mean=0.0, std=self.tied_embed_init_std)
+        for module in self.modules():
+            if isinstance(module, nn.Linear) and getattr(module, "_zero_init", False):
+                nn.init.zeros_(module.weight)
+
+    def forward(self, input_ids: Tensor, target_ids: Tensor) -> Tensor:
+        x = self.tok_emb(input_ids)
+        x = F.rms_norm(x, (x.size(-1),))
+        x0 = x
+        skips: list[Tensor] = []
+
+        # First half stores skips; second half reuses them in reverse order.
+        for i in range(self.num_encoder_layers):
+            x = self.blocks[i](x, x0)
+            skips.append(x)
+        for i in range(self.num_decoder_layers):
+            if skips:
+                x = x + self.skip_weights[i].to(dtype=x.dtype)[None, None, :] * skips.pop()
+            x = self.blocks[self.num_encoder_layers + i](x, x0)
+
+        x = self.final_norm(x).reshape(-1, x.size(-1))
+        targets = target_ids.reshape(-1)
+        if self.tie_embeddings:
+            logits_proj = F.linear(x, self.tok_emb.weight)
+        else:
+            if self.lm_head is None:
+                raise RuntimeError("lm_head is required when tie_embeddings=False")
+            logits_proj = self.lm_head(x)
+        logits = self.logit_softcap * torch.tanh(logits_proj / self.logit_softcap)
+        return F.cross_entropy(logits.float(), targets, reduction="mean")
+
+
+# -----------------------------
+# TRAINING
+# -----------------------------
+
+def main() -> None:
+    global zeropower_via_newtonschulz5
+
+    code = Path(__file__).read_text(encoding="utf-8")
+    args = Hyperparameters()
+    zeropower_via_newtonschulz5 = torch.compile(zeropower_via_newtonschulz5)
+
+    # -----------------------------
+    # DISTRIBUTED + CUDA SETUP
+    # -----------------------------
+
+    distributed = "RANK" in os.environ and "WORLD_SIZE" in os.environ
+    rank = int(os.environ.get("RANK", "0"))
+    world_size = int(os.environ.get("WORLD_SIZE", "1"))
+    local_rank = int(os.environ.get("LOCAL_RANK", "0"))
+    if world_size <= 0:
+        raise ValueError(f"WORLD_SIZE must be positive, got {world_size}")
+    if 8 % world_size != 0:
+        raise ValueError(f"WORLD_SIZE={world_size} must divide 8 so grad_accum_steps stays integral")
+    grad_accum_steps = 8 // world_size
+    grad_scale = 1.0 / grad_accum_steps
+    if not torch.cuda.is_available():
+        raise RuntimeError("CUDA is required")
+    device = torch.device("cuda", local_rank)
+    torch.cuda.set_device(device)
+    if distributed:
+        dist.init_process_group(backend="nccl", device_id=device)
+        dist.barrier()
+    master_process = rank == 0
+
+    # Fast math knobs
+    torch.backends.cuda.matmul.allow_tf32 = True
+    torch.backends.cudnn.allow_tf32 = True
+    from torch.backends.cuda import enable_cudnn_sdp, enable_flash_sdp, enable_math_sdp, enable_mem_efficient_sdp
+
+    enable_cudnn_sdp(False)
+    enable_flash_sdp(True)
+    enable_mem_efficient_sdp(False)
+    enable_math_sdp(False)
+
+    logfile = None
+    if master_process:
+        os.makedirs("logs", exist_ok=True)
+        logfile = f"logs/{args.run_id}.txt"
+        print(logfile)
+
+    def log0(msg: str, console: bool = True) -> None:
+        if not master_process:
+            return
+        if console:
+            print(msg)
+        if logfile is not None:
+            with open(logfile, "a", encoding="utf-8") as f:
+                print(msg, file=f)
+
+    log0(code, console=False)
+    log0("=" * 100, console=False)
+    log0(f"Running Python {sys.version}", console=False)
+    log0(f"Running PyTorch {torch.__version__}", console=False)
+    log0(
+        subprocess.run(["nvidia-smi"], stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True, check=False).stdout,
+        console=False,
+    )
+    log0("=" * 100, console=False)
+
+    # -----------------------------
+    # TOKENIZER + VALIDATION METRIC SETUP
+    # -----------------------------
+
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    torch.manual_seed(args.seed)
+    torch.cuda.manual_seed_all(args.seed)
+
+    if not args.tokenizer_path.endswith(".model"):
+        raise ValueError(f"Script only setup for SentencePiece .model file: {args.tokenizer_path}")
+    sp = spm.SentencePieceProcessor(model_file=args.tokenizer_path)
+    if int(sp.vocab_size()) != args.vocab_size:
+        raise ValueError(
+            f"VOCAB_SIZE={args.vocab_size} does not match tokenizer vocab_size={int(sp.vocab_size())}"
+        )
+    dataset_dir = Path(args.data_path).resolve()
+    actual_train_files = len(list(dataset_dir.glob("fineweb_train_*.bin")))
+    val_tokens = load_validation_tokens(args.val_files, args.train_seq_len)
+    base_bytes_lut, has_leading_space_lut, is_boundary_token_lut = build_sentencepiece_luts(
+        sp, args.vocab_size, device
+    )
+    log0(f"val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path={args.tokenizer_path}")
+    log0(f"train_loader:dataset:{dataset_dir.name} train_shards:{actual_train_files}")
+    log0(f"val_loader:shards pattern={args.val_files} tokens:{val_tokens.numel() - 1}")
+
+    # -----------------------------
+    # MODEL + OPTIMIZER SETUP
+    # -----------------------------
+
+    base_model = GPT(
+        vocab_size=args.vocab_size,
+        num_layers=args.num_layers,
+        model_dim=args.model_dim,
+        num_heads=args.num_heads,
+        num_kv_heads=args.num_kv_heads,
+        mlp_mult=args.mlp_mult,
+        tie_embeddings=args.tie_embeddings,
+        tied_embed_init_std=args.tied_embed_init_std,
+        logit_softcap=args.logit_softcap,
+        rope_base=args.rope_base,
+        qk_gain_init=args.qk_gain_init,
+    ).to(device).bfloat16()
+    for module in base_model.modules():
+        if isinstance(module, CastedLinear):
+            module.float()
+    restore_low_dim_params_to_fp32(base_model)
+    compiled_model = torch.compile(base_model, dynamic=False, fullgraph=True)
+    model: nn.Module = DDP(compiled_model, device_ids=[local_rank], broadcast_buffers=False) if distributed else compiled_model
+
+    # Optimizer split:
+    # - token embedding (Adam) uses EMBED_LR
+    # - untied lm_head (Adam) uses HEAD_LR
+    # - matrix params in transformer blocks use MATRIX_LR via Muon
+    # - vectors/scalars use SCALAR_LR via Adam
+    block_named_params = list(base_model.blocks.named_parameters())
+    matrix_params = [
+        p
+        for name, p in block_named_params
+        if p.ndim == 2 and not any(pattern in name for pattern in CONTROL_TENSOR_NAME_PATTERNS)
+    ]
+    scalar_params = [
+        p
+        for name, p in block_named_params
+        if p.ndim < 2 or any(pattern in name for pattern in CONTROL_TENSOR_NAME_PATTERNS)
+    ]
+    if base_model.skip_weights.numel() > 0:
+        scalar_params.append(base_model.skip_weights)
+    token_lr = args.tied_embed_lr if args.tie_embeddings else args.embed_lr
+    optimizer_tok = torch.optim.Adam(
+        [{"params": [base_model.tok_emb.weight], "lr": token_lr, "base_lr": token_lr}],
+        betas=(args.beta1, args.beta2),
+        eps=args.adam_eps,
+        fused=True,
+    )
+    optimizer_muon = Muon(
+        matrix_params,
+        lr=args.matrix_lr,
+        momentum=args.muon_momentum,
+        backend_steps=args.muon_backend_steps,
+    )
+    for group in optimizer_muon.param_groups:
+        group["base_lr"] = args.matrix_lr
+    optimizer_scalar = torch.optim.Adam(
+        [{"params": scalar_params, "lr": args.scalar_lr, "base_lr": args.scalar_lr}],
+        betas=(args.beta1, args.beta2),
+        eps=args.adam_eps,
+        fused=True,
+    )
+    optimizers: list[torch.optim.Optimizer] = [optimizer_tok, optimizer_muon, optimizer_scalar]
+    if base_model.lm_head is not None:
+        optimizer_head = torch.optim.Adam(
+            [{"params": [base_model.lm_head.weight], "lr": args.head_lr, "base_lr": args.head_lr}],
+            betas=(args.beta1, args.beta2),
+            eps=args.adam_eps,
+            fused=True,
+        )
+        optimizers.insert(1, optimizer_head)
+
+    n_params = sum(p.numel() for p in base_model.parameters())
+    log0(f"model_params:{n_params}")
+    log0(f"world_size:{world_size} grad_accum_steps:{grad_accum_steps}")
+    log0("sdp_backends:cudnn=False flash=True mem_efficient=False math=False")
+    log0(f"attention_mode:gqa num_heads:{args.num_heads} num_kv_heads:{args.num_kv_heads}")
+    log0(
+        f"tie_embeddings:{args.tie_embeddings} embed_lr:{token_lr} "
+        f"head_lr:{args.head_lr if base_model.lm_head is not None else 0.0} "
+        f"matrix_lr:{args.matrix_lr} scalar_lr:{args.scalar_lr}"
+    )
+    log0(
+        f"train_batch_tokens:{args.train_batch_tokens} train_seq_len:{args.train_seq_len} "
+        f"iterations:{args.iterations} warmup_steps:{args.warmup_steps} "
+        f"max_wallclock_seconds:{args.max_wallclock_seconds:.3f}"
+    )
+    log0(f"seed:{args.seed}")
+
+    # -----------------------------
+    # DATA LOADER & MODEL WARMUP
+    # -----------------------------
+
+    train_loader = DistributedTokenLoader(args.train_files, rank, world_size, device)
+
+    def zero_grad_all() -> None:
+        for opt in optimizers:
+            opt.zero_grad(set_to_none=True)
+
+    max_wallclock_ms = 1000.0 * args.max_wallclock_seconds if args.max_wallclock_seconds > 0 else None
+
+    def lr_mul(step: int, elapsed_ms: float) -> float:
+        if args.warmdown_iters <= 0:
+            return 1.0
+        if max_wallclock_ms is None:
+            warmdown_start = max(args.iterations - args.warmdown_iters, 0)
+            return max((args.iterations - step) / max(args.warmdown_iters, 1), 0.0) if warmdown_start <= step < args.iterations else 1.0
+        step_ms = elapsed_ms / max(step, 1)
+        warmdown_ms = args.warmdown_iters * step_ms
+        remaining_ms = max(max_wallclock_ms - elapsed_ms, 0.0)
+        return remaining_ms / max(warmdown_ms, 1e-9) if remaining_ms <= warmdown_ms else 1.0
+
+    # Warmup primes the compiled forward/backward/optimizer paths, then we restore the
+    # initial weights/optimizer state so measured training starts from the true init.
+    if args.warmup_steps > 0:
+        initial_model_state = {name: tensor.detach().cpu().clone() for name, tensor in base_model.state_dict().items()}
+        initial_optimizer_states = [copy.deepcopy(opt.state_dict()) for opt in optimizers]
+        model.train()
+        for warmup_step in range(args.warmup_steps):
+            zero_grad_all()
+            for micro_step in range(grad_accum_steps):
+                if distributed:
+                    model.require_backward_grad_sync = micro_step == grad_accum_steps - 1
+                x, y = train_loader.next_batch(args.train_batch_tokens, args.train_seq_len, grad_accum_steps)
+                with torch.autocast(device_type="cuda", dtype=torch.bfloat16, enabled=True):
+                    warmup_loss = model(x, y)
+                (warmup_loss * grad_scale).backward()
+            for opt in optimizers:
+                opt.step()
+            zero_grad_all()
+            if args.warmup_steps <= 20 or (warmup_step + 1) % 10 == 0 or warmup_step + 1 == args.warmup_steps:
+                log0(f"warmup_step:{warmup_step + 1}/{args.warmup_steps}")
+        base_model.load_state_dict(initial_model_state, strict=True)
+        for opt, state in zip(optimizers, initial_optimizer_states, strict=True):
+            opt.load_state_dict(state)
+        zero_grad_all()
+        if distributed:
+            model.require_backward_grad_sync = True
+        train_loader = DistributedTokenLoader(args.train_files, rank, world_size, device)
+
+    # -----------------------------
+    # MAIN TRAINING LOOP
+    # -----------------------------
+
+    training_time_ms = 0.0
+    stop_after_step: int | None = None
+    torch.cuda.synchronize()
+    t0 = time.perf_counter()
+
+    step = 0
+    while True:
+        last_step = step == args.iterations or (stop_after_step is not None and step >= stop_after_step)
+
+        should_validate = last_step or (args.val_loss_every > 0 and step % args.val_loss_every == 0)
+        if should_validate:
+            torch.cuda.synchronize()
+            training_time_ms += 1000.0 * (time.perf_counter() - t0)
+            val_loss, val_bpb = eval_val(
+                args,
+                model,
+                rank,
+                world_size,
+                device,
+                grad_accum_steps,
+                val_tokens,
+                base_bytes_lut,
+                has_leading_space_lut,
+                is_boundary_token_lut,
+            )
+            log0(
+                f"step:{step}/{args.iterations} val_loss:{val_loss:.4f} val_bpb:{val_bpb:.4f} "
+                f"train_time:{training_time_ms:.0f}ms step_avg:{training_time_ms / max(step, 1):.2f}ms"
+            )
+            torch.cuda.synchronize()
+            t0 = time.perf_counter()
+
+        if last_step:
+            if stop_after_step is not None and step < args.iterations:
+                log0(
+                    f"stopping_early: wallclock_cap train_time:{training_time_ms:.0f}ms "
+                    f"step:{step}/{args.iterations}"
+                )
+            break
+
+        elapsed_ms = training_time_ms + 1000.0 * (time.perf_counter() - t0)
+        scale = lr_mul(step, elapsed_ms)
+        zero_grad_all()
+        train_loss = torch.zeros((), device=device)
+        for micro_step in range(grad_accum_steps):
+            if distributed:
+                model.require_backward_grad_sync = micro_step == grad_accum_steps - 1
+            x, y = train_loader.next_batch(args.train_batch_tokens, args.train_seq_len, grad_accum_steps)
+            with torch.autocast(device_type="cuda", dtype=torch.bfloat16, enabled=True):
+                loss = model(x, y)
+            train_loss += loss.detach()
+            (loss * grad_scale).backward()
+        train_loss /= grad_accum_steps
+
+        frac = min(step / args.muon_momentum_warmup_steps, 1.0) if args.muon_momentum_warmup_steps > 0 else 1.0
+        muon_momentum = (1 - frac) * args.muon_momentum_warmup_start + frac * args.muon_momentum
+        for group in optimizer_muon.param_groups:
+            group["momentum"] = muon_momentum
+
+        for opt in optimizers:
+            for group in opt.param_groups:
+                group["lr"] = group["base_lr"] * scale
+
+        if args.grad_clip_norm > 0:
+            torch.nn.utils.clip_grad_norm_(base_model.parameters(), args.grad_clip_norm)
+        for opt in optimizers:
+            opt.step()
+        zero_grad_all()
+
+        step += 1
+        approx_training_time_ms = training_time_ms + 1000.0 * (time.perf_counter() - t0)
+        should_log_train = (
+            args.train_log_every > 0
+            and (step <= 10 or step % args.train_log_every == 0 or stop_after_step is not None)
+        )
+        if should_log_train:
+            log0(
+                f"step:{step}/{args.iterations} train_loss:{train_loss.item():.4f} "
+                f"train_time:{approx_training_time_ms:.0f}ms step_avg:{approx_training_time_ms / step:.2f}ms"
+            )
+
+        # Needed to sync whether we've reached the wallclock cap.
+        reached_cap = max_wallclock_ms is not None and approx_training_time_ms >= max_wallclock_ms
+        if distributed and max_wallclock_ms is not None:
+            reached_cap_tensor = torch.tensor(int(reached_cap), device=device)
+            dist.all_reduce(reached_cap_tensor, op=dist.ReduceOp.MAX)
+            reached_cap = bool(reached_cap_tensor.item())
+        if stop_after_step is None and reached_cap:
+            stop_after_step = step
+
+    log0(
+        f"peak memory allocated: {torch.cuda.max_memory_allocated() // 1024 // 1024} MiB "
+        f"reserved: {torch.cuda.max_memory_reserved() // 1024 // 1024} MiB"
+    )
+
+    # -----------------------------
+    # SERIALIZATION + ROUNDTRIP VALIDATION
+    # -----------------------------
+    # Save the raw state (useful for debugging/loading in PyTorch directly), then always produce
+    # the compressed int8+zlib artifact and validate the round-tripped weights.
+
+    if master_process:
+        torch.save(base_model.state_dict(), "final_model.pt")
+        model_bytes = os.path.getsize("final_model.pt")
+        code_bytes = len(code.encode("utf-8"))
+        log0(f"Serialized model: {model_bytes} bytes")
+        log0(f"Code size: {code_bytes} bytes")
+        log0(f"Total submission size: {model_bytes + code_bytes} bytes")
+
+    quant_obj, quant_stats = quantize_state_dict_int8(base_model.state_dict())
+    quant_buf = io.BytesIO()
+    torch.save(quant_obj, quant_buf)
+    quant_raw = quant_buf.getvalue()
+    quant_blob = zlib.compress(quant_raw, level=9)
+    quant_raw_bytes = len(quant_raw)
+    if master_process:
+        with open("final_model.int8.ptz", "wb") as f:
+            f.write(quant_blob)
+        quant_file_bytes = os.path.getsize("final_model.int8.ptz")
+        code_bytes = len(code.encode("utf-8"))
+        ratio = quant_stats["baseline_tensor_bytes"] / max(quant_stats["int8_payload_bytes"], 1)
+        log0(
+            f"Serialized model int8+zlib: {quant_file_bytes} bytes "
+            f"(payload:{quant_stats['int8_payload_bytes']} raw_torch:{quant_raw_bytes} payload_ratio:{ratio:.2f}x)"
+        )
+        log0(f"Total submission size int8+zlib: {quant_file_bytes + code_bytes} bytes")
+
+    if distributed:
+        dist.barrier()
+    with open("final_model.int8.ptz", "rb") as f:
+        quant_blob_disk = f.read()
+    quant_state = torch.load(io.BytesIO(zlib.decompress(quant_blob_disk)), map_location="cpu")
+    base_model.load_state_dict(dequantize_state_dict_int8(quant_state), strict=True)
+    torch.cuda.synchronize()
+    t_qeval = time.perf_counter()
+    q_val_loss, q_val_bpb = eval_val(
+        args,
+        model,
+        rank,
+        world_size,
+        device,
+        grad_accum_steps,
+        val_tokens,
+        base_bytes_lut,
+        has_leading_space_lut,
+        is_boundary_token_lut,
+    )
+    torch.cuda.synchronize()
+    log0(
+        f"final_int8_zlib_roundtrip val_loss:{q_val_loss:.4f} val_bpb:{q_val_bpb:.4f} "
+        f"eval_time:{1000.0 * (time.perf_counter() - t_qeval):.0f}ms"
+    )
+    log0(f"final_int8_zlib_roundtrip_exact val_loss:{q_val_loss:.8f} val_bpb:{q_val_bpb:.8f}")
+
+    if distributed:
+        dist.destroy_process_group()
+
+
+if __name__ == "__main__":
+    main()
diff --git a/records/track_non_record_16mb/2026-03-18_Quasi10Bfrom50B_SP1024_9x512_KV4_4h_pgut3/README.md b/records/track_non_record_16mb/2026-03-18_Quasi10Bfrom50B_SP1024_9x512_KV4_4h_pgut3/README.md
new file mode 100644
index 0000000..c36c90b
--- /dev/null
+++ b/records/track_non_record_16mb/2026-03-18_Quasi10Bfrom50B_SP1024_9x512_KV4_4h_pgut3/README.md
@@ -0,0 +1,56 @@
+This record captures an unlimited-compute non-record submission built from the current root `train_gpt.py`.
+
+This run is not intended to satisfy the 10-minute cutoff for the main leaderboard. It uses the same 9x512 SP-1024 tied-embedding baseline layout, but extends training to a 4-hour wallclock cap on `pgut3` while evaluating against the full 50k-document validation split every 20k steps.
+
+Configuration:
+- Track: `non-record`, unlimited compute, still under the `16,000,000` byte artifact cap
+- Layout: `VOCAB_SIZE=1024 NUM_LAYERS=9 MODEL_DIM=512 NUM_HEADS=8 NUM_KV_HEADS=4 MLP_MULT=2`
+- Tied output/input embeddings: `TIE_EMBEDDINGS=1`
+- Tied embedding LR: `TIED_EMBED_LR=0.05`
+- Batching: `TRAIN_BATCH_TOKENS=524288 TRAIN_SEQ_LEN=1024`
+- Validation cadence: `VAL_LOSS_EVERY=20000` on the full `fineweb_val_*` split
+
+Command (track-relevant params):
+```bash
+OMP_NUM_THREADS=1 \
+TORCH_NCCL_ASYNC_ERROR_HANDLING=1 \
+RUN_ID=train_gpt_pgut3_quasi10b_sp1024_4h_20260318_075102 \
+DATA_PATH=/tmp/fineweb_quasi10Bfrom50B_50keval_sp1024_v0/datasets/fineweb10B_sp1024 \
+TOKENIZER_PATH=/tmp/fineweb_quasi10Bfrom50B_50keval_sp1024_v0/tokenizers/fineweb_1024_bpe.model \
+VOCAB_SIZE=1024 \
+NUM_LAYERS=9 \
+MODEL_DIM=512 \
+NUM_HEADS=8 \
+NUM_KV_HEADS=4 \
+MLP_MULT=2 \
+TIE_EMBEDDINGS=1 \
+TIED_EMBED_LR=0.05 \
+ITERATIONS=500000 \
+WARMUP_STEPS=20 \
+MAX_WALLCLOCK_SECONDS=14400 \
+TRAIN_BATCH_TOKENS=524288 \
+TRAIN_SEQ_LEN=1024 \
+TRAIN_LOG_EVERY=200 \
+VAL_LOSS_EVERY=20000 \
+torchrun --standalone --nproc_per_node=8 /root/code/parameter-golf/train_gpt.py
+```
+
+Key metrics (from `train.log`):
+- Timed training stopped at `329430/500000` steps due to the 4-hour wallclock cap.
+- Best pre-quant eval at stop: `val_loss:1.9837`, `val_bpb:1.1749`
+- Post-quant roundtrip eval: `val_loss:2.0386`, `val_bpb:1.2074`
+- Exact printed metric: `final_int8_zlib_roundtrip_exact val_bpb:1.20737944`
+- Train time: `14400039ms` (`step_avg:43.71ms`)
+- Peak memory: `10184 MiB allocated`, `10588 MiB reserved`
+- Serialized model int8+zlib: `15762519 bytes`
+- Code size: `47642 bytes`
+- Total submission size int8+zlib: `15810161 bytes`
+
+Training volume:
+- Global batch: `524288` tokens/step
+- Total train tokens seen: `172716195840`
+
+Included files:
+- `train_gpt.py` (code snapshot used for the run)
+- `train.log` (exact remote training log)
+- `submission.json` (leaderboard metadata)
diff --git a/records/track_non_record_16mb/2026-03-18_Quasi10Bfrom50B_SP1024_9x512_KV4_4h_pgut3/submission.json b/records/track_non_record_16mb/2026-03-18_Quasi10Bfrom50B_SP1024_9x512_KV4_4h_pgut3/submission.json
new file mode 100644
index 0000000..4192982
--- /dev/null
+++ b/records/track_non_record_16mb/2026-03-18_Quasi10Bfrom50B_SP1024_9x512_KV4_4h_pgut3/submission.json
@@ -0,0 +1,17 @@
+{
+  "author": "Will DePue",
+  "github_id": "williamd",
+  "name": "4-Hour Quasi-10B SP1024",
+  "blurb": "Unlimited compute track: SP-1024 9x512 KV4 run on pgut3 for 4 hours against the quasi10Bfrom50B 50k-eval export; pre-quant reached 1.1749 BPB at wallclock stop and final int8+zlib roundtrip scored 1.2074 under the 16,000,000-byte cap.",
+  "date": "2026-03-18T11:53:00Z",
+  "track": "non-record-unlimited-compute-16mb",
+  "val_loss": 2.03860961,
+  "val_bpb": 1.20737944,
+  "pre_quant_val_loss": 1.9837,
+  "pre_quant_val_bpb": 1.1749,
+  "step_stop": 329430,
+  "wallclock_seconds": 14400.039,
+  "bytes_total": 15810161,
+  "bytes_model_int8_zlib": 15762519,
+  "bytes_code": 47642
+}
diff --git a/records/track_non_record_16mb/2026-03-18_Quasi10Bfrom50B_SP1024_9x512_KV4_4h_pgut3/train.log b/records/track_non_record_16mb/2026-03-18_Quasi10Bfrom50B_SP1024_9x512_KV4_4h_pgut3/train.log
new file mode 100644
index 0000000..fbc4823
--- /dev/null
+++ b/records/track_non_record_16mb/2026-03-18_Quasi10Bfrom50B_SP1024_9x512_KV4_4h_pgut3/train.log
@@ -0,0 +1,2901 @@
+"""
+The `train_gpt.py` and `train_gpt_mlx.py` scripts are intended as good launching-off points for new participants, not SOTA configs. We'll accept PRs that tune, improve, or simplify these scripts without significantly increasing complexity, but competitive submissions should stay in the `/records` folder.
+
+Hard stop: `train_gpt.py` and `train_gpt_mlx.py` must never be longer than 1500 lines.
+"""
+
+from __future__ import annotations
+
+import copy
+import glob
+import io
+import math
+import os
+import random
+import subprocess
+import sys
+import time
+import uuid
+import zlib
+from pathlib import Path
+
+import numpy as np
+import sentencepiece as spm
+import torch
+import torch.distributed as dist
+import torch.nn.functional as F
+from torch import Tensor, nn
+from torch.nn.parallel import DistributedDataParallel as DDP
+
+# -----------------------------
+# HYPERPARAMETERS
+# -----------------------------
+# Default Simple Baseline run:
+# - 9 transformer blocks at width 512
+# - 8 attention heads with 4 KV heads (GQA) and 2x MLP expansion
+# - vocab size 1024, sequence length 1024, tied embeddings
+# - 524,288 train tokens per step for 20,000 iterations with a ~10 minute cap
+
+class Hyperparameters:
+    # Data paths are shard globs produced by the existing preprocessing pipeline.
+    data_path = os.environ.get("DATA_PATH", "./data/datasets/fineweb10B_sp1024")
+    train_files = os.path.join(data_path, "fineweb_train_*.bin")
+    val_files = os.path.join(data_path, "fineweb_val_*.bin")
+    tokenizer_path = os.environ.get("TOKENIZER_PATH", "./data/tokenizers/fineweb_1024_bpe.model")
+    run_id = os.environ.get("RUN_ID", str(uuid.uuid4()))
+    seed = int(os.environ.get("SEED", 1337))
+
+    # Validation cadence and batch size. Validation always uses the full fineweb_val split.
+    val_batch_size = int(os.environ.get("VAL_BATCH_SIZE", 524_288))
+    val_loss_every = int(os.environ.get("VAL_LOSS_EVERY", 1000))
+    train_log_every = int(os.environ.get("TRAIN_LOG_EVERY", 200))
+
+    # Training length.
+    iterations = int(os.environ.get("ITERATIONS", 20000))
+    warmdown_iters = int(os.environ.get("WARMDOWN_ITERS", 1200))
+    warmup_steps = int(os.environ.get("WARMUP_STEPS", 20))
+    train_batch_tokens = int(os.environ.get("TRAIN_BATCH_TOKENS", 524_288))
+    train_seq_len = int(os.environ.get("TRAIN_SEQ_LEN", 1024))
+    max_wallclock_seconds = float(os.environ.get("MAX_WALLCLOCK_SECONDS", 600.0))
+    qk_gain_init = float(os.environ.get("QK_GAIN_INIT", 1.5))
+
+    # Model shape.
+    vocab_size = int(os.environ.get("VOCAB_SIZE", 1024))
+    num_layers = int(os.environ.get("NUM_LAYERS", 9))
+    num_kv_heads = int(os.environ.get("NUM_KV_HEADS", 4))
+    model_dim = int(os.environ.get("MODEL_DIM", 512))
+    num_heads = int(os.environ.get("NUM_HEADS", 8))
+    mlp_mult = int(os.environ.get("MLP_MULT", 2))
+    tie_embeddings = bool(int(os.environ.get("TIE_EMBEDDINGS", "1")))
+    rope_base = float(os.environ.get("ROPE_BASE", 10000.0))
+    logit_softcap = float(os.environ.get("LOGIT_SOFTCAP", 30.0))
+
+    # Optimizer hyperparameters.
+    embed_lr = float(os.environ.get("EMBED_LR", 0.6))
+    head_lr = float(os.environ.get("HEAD_LR", 0.008))
+    tied_embed_lr = float(os.environ.get("TIED_EMBED_LR", 0.05))
+    tied_embed_init_std = float(os.environ.get("TIED_EMBED_INIT_STD", 0.005))
+    matrix_lr = float(os.environ.get("MATRIX_LR", 0.04))
+    scalar_lr = float(os.environ.get("SCALAR_LR", 0.04))
+    muon_momentum = float(os.environ.get("MUON_MOMENTUM", 0.95))
+    muon_backend_steps = int(os.environ.get("MUON_BACKEND_STEPS", 5))
+    muon_momentum_warmup_start = float(os.environ.get("MUON_MOMENTUM_WARMUP_START", 0.85))
+    muon_momentum_warmup_steps = int(os.environ.get("MUON_MOMENTUM_WARMUP_STEPS", 500))
+    beta1 = float(os.environ.get("BETA1", 0.9))
+    beta2 = float(os.environ.get("BETA2", 0.95))
+    adam_eps = float(os.environ.get("ADAM_EPS", 1e-8))
+    grad_clip_norm = float(os.environ.get("GRAD_CLIP_NORM", 0.0))
+
+# -----------------------------
+# MUON OPTIMIZER 
+# -----------------------------
+# 
+# As borrowed from modded-nanogpt
+# Background on Muon: https://kellerjordan.github.io/posts/muon/
+
+def zeropower_via_newtonschulz5(G: Tensor, steps: int = 10, eps: float = 1e-7) -> Tensor:
+    # Orthogonalize a 2D update matrix with a fast Newton-Schulz iteration.
+    # Muon uses this to normalize matrix-shaped gradients before applying them.
+    a, b, c = (3.4445, -4.7750, 2.0315)
+    X = G.bfloat16()
+    X /= X.norm() + eps
+    transposed = G.size(0) > G.size(1)
+    if transposed:
+        X = X.T
+    for _ in range(steps):
+        A = X @ X.T
+        B = b * A + c * A @ A
+        X = a * X + B @ X
+    return X.T if transposed else X
+
+
+class Muon(torch.optim.Optimizer):
+    def __init__(self, params, lr: float, momentum: float, backend_steps: int, nesterov: bool = True):
+        super().__init__(
+            params,
+            dict(lr=lr, momentum=momentum, backend_steps=backend_steps, nesterov=nesterov),
+        )
+
+    @torch.no_grad()
+    def step(self, closure=None):
+        loss = None
+        if closure is not None:
+            with torch.enable_grad():
+                loss = closure()
+
+        distributed = dist.is_available() and dist.is_initialized()
+        world_size = dist.get_world_size() if distributed else 1
+        rank = dist.get_rank() if distributed else 0
+
+        for group in self.param_groups:
+            params = group["params"]
+            if not params:
+                continue
+            lr = group["lr"]
+            momentum = group["momentum"]
+            backend_steps = group["backend_steps"]
+            nesterov = group["nesterov"]
+
+            total_params = sum(int(p.numel()) for p in params)
+            updates_flat = torch.zeros(total_params, device=params[0].device, dtype=torch.bfloat16)
+
+            curr = 0
+            for i, p in enumerate(params):
+                if i % world_size == rank and p.grad is not None:
+                    g = p.grad
+                    state = self.state[p]
+                    if "momentum_buffer" not in state:
+                        state["momentum_buffer"] = torch.zeros_like(g)
+                    buf = state["momentum_buffer"]
+                    buf.mul_(momentum).add_(g)
+                    if nesterov:
+                        g = g.add(buf, alpha=momentum)
+                    g = zeropower_via_newtonschulz5(g, steps=backend_steps)
+                    # Scale correction from Muon reference implementations.
+                    g *= max(1, g.size(0) / g.size(1)) ** 0.5
+                    updates_flat[curr : curr + p.numel()] = g.reshape(-1)
+                curr += p.numel()
+
+            if distributed:
+                dist.all_reduce(updates_flat, op=dist.ReduceOp.SUM)
+
+            curr = 0
+            for p in params:
+                g = updates_flat[curr : curr + p.numel()].view_as(p).to(dtype=p.dtype)
+                p.add_(g, alpha=-lr)
+                curr += p.numel()
+
+        return loss
+
+
+# -----------------------------
+# TOKENIZER-AGNOSTIC EVALUATION SETUP 
+# -----------------------------
+#
+# It's common for small models have a large fraction of their parameters be embeddings, since the 2 * d_model * d_vocab vectors can be gigantic.
+# Instead of locking the tokenizer, we let you bring your own and calculate our validation metrics on the average compression of the validation set.
+# We calculate BPB (bits-per-byte) instead of validation loss, so we need methods to count the number of bits per token in the tokenizer.
+# Note: Submissions that edit the tokenizer will be examined more carefully, since screwing this up might unjustly improve your score.
+
+def build_sentencepiece_luts(
+    sp: spm.SentencePieceProcessor, vocab_size: int, device: torch.device
+) -> tuple[Tensor, Tensor, Tensor]:
+    sp_vocab_size = int(sp.vocab_size())
+    table_size = max(sp_vocab_size, vocab_size)
+    base_bytes_np = np.zeros((table_size,), dtype=np.int16)
+    has_leading_space_np = np.zeros((table_size,), dtype=np.bool_)
+    is_boundary_token_np = np.ones((table_size,), dtype=np.bool_)
+    for token_id in range(sp_vocab_size):
+        if sp.is_control(token_id) or sp.is_unknown(token_id) or sp.is_unused(token_id):
+            continue
+        is_boundary_token_np[token_id] = False
+        if sp.is_byte(token_id):
+            base_bytes_np[token_id] = 1
+            continue
+        piece = sp.id_to_piece(token_id)
+        if piece.startswith("▁"):
+            has_leading_space_np[token_id] = True
+            piece = piece[1:]
+        base_bytes_np[token_id] = len(piece.encode("utf-8"))
+    return (
+        torch.tensor(base_bytes_np, dtype=torch.int16, device=device),
+        torch.tensor(has_leading_space_np, dtype=torch.bool, device=device),
+        torch.tensor(is_boundary_token_np, dtype=torch.bool, device=device),
+    )
+
+
+def load_validation_tokens(pattern: str, seq_len: int) -> Tensor:
+    files = [Path(p) for p in sorted(glob.glob(pattern))]
+    if not files:
+        raise FileNotFoundError(f"No files found for pattern: {pattern}")
+    # The export pipeline writes the fixed first-50k-doc validation set to fineweb_val_*.
+    tokens = torch.cat([load_data_shard(file) for file in files]).contiguous()
+    usable = ((tokens.numel() - 1) // seq_len) * seq_len
+    if usable <= 0:
+        raise ValueError(f"Validation split is too short for TRAIN_SEQ_LEN={seq_len}")
+    return tokens[: usable + 1]
+
+
+def eval_val(
+    args: Hyperparameters,
+    model: nn.Module,
+    rank: int,
+    world_size: int,
+    device: torch.device,
+    grad_accum_steps: int,
+    val_tokens: Tensor,
+    base_bytes_lut: Tensor,
+    has_leading_space_lut: Tensor,
+    is_boundary_token_lut: Tensor,
+) -> tuple[float, float]:
+    # Validation computes two metrics:
+    # - val_loss: token cross-entropy (natural log)
+    # - val_bpb: tokenizer-agnostic compression metric used by the challenge
+    local_batch_tokens = args.val_batch_size // (world_size * grad_accum_steps)
+    if local_batch_tokens < args.train_seq_len:
+        raise ValueError(
+            "VAL_BATCH_SIZE must provide at least one sequence per rank; "
+            f"got VAL_BATCH_SIZE={args.val_batch_size}, WORLD_SIZE={world_size}, "
+            f"GRAD_ACCUM_STEPS={grad_accum_steps}, TRAIN_SEQ_LEN={args.train_seq_len}"
+        )
+    local_batch_seqs = local_batch_tokens // args.train_seq_len
+    total_seqs = (val_tokens.numel() - 1) // args.train_seq_len
+    seq_start = (total_seqs * rank) // world_size
+    seq_end = (total_seqs * (rank + 1)) // world_size
+    val_loss_sum = torch.zeros((), device=device, dtype=torch.float64)
+    val_token_count = torch.zeros((), device=device, dtype=torch.float64)
+    val_byte_count = torch.zeros((), device=device, dtype=torch.float64)
+
+    model.eval()
+    with torch.inference_mode():
+        for batch_seq_start in range(seq_start, seq_end, local_batch_seqs):
+            batch_seq_end = min(batch_seq_start + local_batch_seqs, seq_end)
+            raw_start = batch_seq_start * args.train_seq_len
+            raw_end = batch_seq_end * args.train_seq_len + 1
+            local = val_tokens[raw_start:raw_end].to(device=device, dtype=torch.int64, non_blocking=True)
+            x = local[:-1].reshape(-1, args.train_seq_len)
+            y = local[1:].reshape(-1, args.train_seq_len)
+            with torch.autocast(device_type="cuda", dtype=torch.bfloat16, enabled=True):
+                batch_loss = model(x, y).detach()
+            batch_token_count = float(y.numel())
+            val_loss_sum += batch_loss.to(torch.float64) * batch_token_count
+            val_token_count += batch_token_count
+            prev_ids = x.reshape(-1)
+            tgt_ids = y.reshape(-1)
+            token_bytes = base_bytes_lut[tgt_ids].to(dtype=torch.int16)
+            token_bytes += (has_leading_space_lut[tgt_ids] & ~is_boundary_token_lut[prev_ids]).to(dtype=torch.int16)
+            val_byte_count += token_bytes.to(torch.float64).sum()
+
+    if dist.is_available() and dist.is_initialized():
+        dist.all_reduce(val_loss_sum, op=dist.ReduceOp.SUM)
+        dist.all_reduce(val_token_count, op=dist.ReduceOp.SUM)
+        dist.all_reduce(val_byte_count, op=dist.ReduceOp.SUM)
+
+    val_loss = val_loss_sum / val_token_count
+    bits_per_token = val_loss.item() / math.log(2.0)
+    tokens_per_byte = val_token_count.item() / val_byte_count.item()
+    model.train()
+    return float(val_loss.item()), float(bits_per_token * tokens_per_byte)
+
+# -----------------------------
+# POST-TRAINING QUANTIZATION
+# -----------------------------
+#
+# It's silly to export our model, which is trained in bf16 and fp32, at that same precision.
+# Instead, we get approximately the same model (with a small hit) by quantizing the model to int8 & zlib compressing.
+# We can then decompress the model and run in higher precision for evaluation, after closing in under the size limit.
+
+CONTROL_TENSOR_NAME_PATTERNS = tuple(
+    pattern
+    for pattern in os.environ.get(
+        "CONTROL_TENSOR_NAME_PATTERNS",
+        "attn_scale,attn_scales,mlp_scale,mlp_scales,resid_mix,resid_mixes,q_gain,skip_weight,skip_weights",
+    ).split(",")
+    if pattern
+)
+INT8_KEEP_FLOAT_FP32_NAME_PATTERNS = tuple(
+    pattern
+    for pattern in os.environ.get(
+        "INT8_KEEP_FLOAT_FP32_NAME_PATTERNS",
+        ",".join(CONTROL_TENSOR_NAME_PATTERNS),
+    ).split(",")
+    if pattern
+)
+INT8_KEEP_FLOAT_MAX_NUMEL = 65_536
+INT8_KEEP_FLOAT_STORE_DTYPE = torch.float16
+INT8_PER_ROW_SCALE_DTYPE = torch.float16
+INT8_CLIP_PERCENTILE = 99.99984
+INT8_CLIP_Q = INT8_CLIP_PERCENTILE / 100.0
+
+def tensor_nbytes(t: Tensor) -> int:
+    return int(t.numel()) * int(t.element_size())
+
+def keep_float_tensor(name: str, t: Tensor, passthrough_orig_dtypes: dict[str, str]) -> Tensor:
+    if any(pattern in name for pattern in INT8_KEEP_FLOAT_FP32_NAME_PATTERNS):
+        return t.float().contiguous()
+    if t.dtype in {torch.float32, torch.bfloat16}:
+        passthrough_orig_dtypes[name] = str(t.dtype).removeprefix("torch.")
+        return t.to(dtype=INT8_KEEP_FLOAT_STORE_DTYPE).contiguous()
+    return t
+
+def quantize_float_tensor(t: Tensor) -> tuple[Tensor, Tensor]:
+    t32 = t.float()
+    if t32.ndim == 2:
+        # Matrices get one scale per row, which usually tracks output-channel
+        # ranges much better than a single tensor-wide scale.
+        clip_abs = (
+            torch.quantile(t32.abs(), INT8_CLIP_Q, dim=1)
+            if t32.numel()
+            else torch.empty((t32.shape[0],), dtype=torch.float32)
+        )
+        clipped = torch.maximum(torch.minimum(t32, clip_abs[:, None]), -clip_abs[:, None])
+        scale = (clip_abs / 127.0).clamp_min(1.0 / 127.0)
+        q = torch.clamp(torch.round(clipped / scale[:, None]), -127, 127).to(torch.int8).contiguous()
+        return q, scale.to(dtype=INT8_PER_ROW_SCALE_DTYPE).contiguous()
+
+    # Vectors / scalars use a simpler per-tensor scale.
+    clip_abs = float(torch.quantile(t32.abs().flatten(), INT8_CLIP_Q).item()) if t32.numel() else 0.0
+    scale = torch.tensor(clip_abs / 127.0 if clip_abs > 0 else 1.0, dtype=torch.float32)
+    q = torch.clamp(torch.round(torch.clamp(t32, -clip_abs, clip_abs) / scale), -127, 127).to(torch.int8).contiguous()
+    return q, scale
+
+def quantize_state_dict_int8(state_dict: dict[str, Tensor]):
+    # Single supported clean-script export format:
+    # - per-row int8 for 2D float tensors
+    # - per-tensor int8 for other float tensors
+    # - exact passthrough for non-floats
+    # - passthrough for small float tensors, stored as fp16 to save bytes
+    quantized: dict[str, Tensor] = {}
+    scales: dict[str, Tensor] = {}
+    dtypes: dict[str, str] = {}
+    passthrough: dict[str, Tensor] = {}
+    passthrough_orig_dtypes: dict[str, str] = {}
+    qmeta: dict[str, dict[str, object]] = {}
+    stats = dict.fromkeys(
+        ("param_count", "num_tensors", "num_float_tensors", "num_nonfloat_tensors", "baseline_tensor_bytes", "int8_payload_bytes"),
+        0,
+    )
+
+    for name, tensor in state_dict.items():
+        t = tensor.detach().to("cpu").contiguous()
+        stats["param_count"] += int(t.numel())
+        stats["num_tensors"] += 1
+        stats["baseline_tensor_bytes"] += tensor_nbytes(t)
+
+        if not t.is_floating_point():
+            stats["num_nonfloat_tensors"] += 1
+            passthrough[name] = t
+            stats["int8_payload_bytes"] += tensor_nbytes(t)
+            continue
+
+        # Small float tensors are cheap enough to keep directly. We still downcast
+        # fp32/bf16 passthrough tensors to fp16 so metadata does not dominate size.
+        if t.numel() <= INT8_KEEP_FLOAT_MAX_NUMEL:
+            kept = keep_float_tensor(name, t, passthrough_orig_dtypes)
+            passthrough[name] = kept
+            stats["int8_payload_bytes"] += tensor_nbytes(kept)
+            continue
+
+        stats["num_float_tensors"] += 1
+        q, s = quantize_float_tensor(t)
+        if s.ndim > 0:
+            qmeta[name] = {"scheme": "per_row", "axis": 0}
+        quantized[name] = q
+        scales[name] = s
+        dtypes[name] = str(t.dtype).removeprefix("torch.")
+        stats["int8_payload_bytes"] += tensor_nbytes(q) + tensor_nbytes(s)
+
+    obj: dict[str, object] = {
+        "__quant_format__": "int8_clean_per_row_v1",
+        "quantized": quantized,
+        "scales": scales,
+        "dtypes": dtypes,
+        "passthrough": passthrough,
+    }
+    if qmeta:
+        obj["qmeta"] = qmeta
+    if passthrough_orig_dtypes:
+        obj["passthrough_orig_dtypes"] = passthrough_orig_dtypes
+    return obj, stats
+
+def dequantize_state_dict_int8(obj: dict[str, object]) -> dict[str, Tensor]:
+    out: dict[str, Tensor] = {}
+    qmeta = obj.get("qmeta", {})
+    passthrough_orig_dtypes = obj.get("passthrough_orig_dtypes", {})
+    for name, q in obj["quantized"].items():
+        dtype = getattr(torch, obj["dtypes"][name])
+        s = obj["scales"][name]
+        if qmeta.get(name, {}).get("scheme") == "per_row" or s.ndim > 0:
+            s = s.to(dtype=torch.float32)
+            # Broadcast the saved row scale back across trailing dimensions.
+            out[name] = (q.float() * s.view(q.shape[0], *([1] * (q.ndim - 1)))).to(dtype=dtype).contiguous()
+        else:
+            scale = float(s.item())
+            out[name] = (q.float() * scale).to(dtype=dtype).contiguous()
+    for name, t in obj["passthrough"].items():
+        # Restore small tensors, undoing the temporary fp16 storage cast if needed.
+        out_t = t.detach().to("cpu").contiguous()
+        orig_dtype = passthrough_orig_dtypes.get(name)
+        if isinstance(orig_dtype, str):
+            out_t = out_t.to(dtype=getattr(torch, orig_dtype)).contiguous()
+        out[name] = out_t
+    return out
+
+
+# -----------------------------
+# DATA LOADING 
+# -----------------------------
+
+def load_data_shard(file: Path) -> Tensor:
+    header_bytes = 256 * np.dtype("<i4").itemsize
+    token_bytes = np.dtype("<u2").itemsize
+    header = np.fromfile(file, dtype="<i4", count=256)
+    # SHARD HEADER INTS & SHARD_MAGIC
+    if header.size != 256 or int(header[0]) != 20240520 or int(header[1]) != 1:
+        raise ValueError(f"Unexpected shard header for {file}")
+    num_tokens = int(header[2])
+    expected_size = header_bytes + num_tokens * token_bytes
+    if file.stat().st_size != expected_size:
+        raise ValueError(f"Shard size mismatch for {file}: expected {expected_size} bytes")
+    tokens_np = np.fromfile(file, dtype="<u2", count=num_tokens, offset=header_bytes)
+    if tokens_np.size != num_tokens:
+        raise ValueError(f"Short read for {file}")
+    return torch.from_numpy(tokens_np.astype(np.uint16, copy=False))
+
+
+class TokenStream:
+    # Reads shards sequentially and wraps around forever. The training loop therefore
+    # has deterministic, simple streaming behavior with no sampling or workers.
+    def __init__(self, pattern: str):
+        self.files = [Path(p) for p in sorted(glob.glob(pattern))]
+        if not self.files:
+            raise FileNotFoundError(f"No files found for pattern: {pattern}")
+        self.file_idx = 0
+        self.tokens = load_data_shard(self.files[0])
+        self.pos = 0
+
+    def _advance_file(self) -> None:
+        self.file_idx = (self.file_idx + 1) % len(self.files)
+        self.tokens = load_data_shard(self.files[self.file_idx])
+        self.pos = 0
+
+    def take(self, n: int) -> Tensor:
+        chunks: list[Tensor] = []
+        remaining = n
+        while remaining > 0:
+            avail = self.tokens.numel() - self.pos
+            if avail <= 0:
+                self._advance_file()
+                continue
+            k = min(remaining, avail)
+            chunks.append(self.tokens[self.pos : self.pos + k])
+            self.pos += k
+            remaining -= k
+        return chunks[0] if len(chunks) == 1 else torch.cat(chunks)
+
+
+class DistributedTokenLoader:
+    # Each call consumes a contiguous chunk from the shared token stream, then slices out
+    # one disjoint span per rank. The extra "+1" token lets us build (x, y) by shifting.
+    def __init__(self, pattern: str, rank: int, world_size: int, device: torch.device):
+        self.rank = rank
+        self.world_size = world_size
+        self.device = device
+        self.stream = TokenStream(pattern)
+
+    def next_batch(self, global_tokens: int, seq_len: int, grad_accum_steps: int) -> tuple[Tensor, Tensor]:
+        local_tokens = global_tokens // (self.world_size * grad_accum_steps)
+        per_rank_span = local_tokens + 1
+        chunk = self.stream.take(per_rank_span * self.world_size)
+        start = self.rank * per_rank_span
+        local = chunk[start : start + per_rank_span].to(dtype=torch.int64)
+        x = local[:-1].reshape(-1, seq_len)
+        y = local[1:].reshape(-1, seq_len)
+        return x.to(self.device, non_blocking=True), y.to(self.device, non_blocking=True)
+
+# -----------------------------
+# TRANSFORMER MODULES
+# -----------------------------
+
+class RMSNorm(nn.Module):
+    def __init__(self, eps: float | None = None):
+        super().__init__()
+        self.eps = eps
+
+    def forward(self, x: Tensor) -> Tensor:
+        return F.rms_norm(x, (x.size(-1),), eps=self.eps)
+
+
+class CastedLinear(nn.Linear):
+    # Keep weights in fp32 for optimizer/state quality, cast at matmul time for bf16 compute.
+    def forward(self, x: Tensor) -> Tensor:
+        bias = self.bias.to(x.dtype) if self.bias is not None else None
+        return F.linear(x, self.weight.to(x.dtype), bias)
+
+
+def restore_low_dim_params_to_fp32(module: nn.Module) -> None:
+    # Keep small/control parameters in fp32 even when the model body runs in bf16.
+    with torch.no_grad():
+        for name, param in module.named_parameters():
+            if (param.ndim < 2 or any(pattern in name for pattern in CONTROL_TENSOR_NAME_PATTERNS)) and param.dtype != torch.float32:
+                param.data = param.data.float()
+
+
+class Rotary(nn.Module):
+    # Caches cos/sin tables per sequence length on the current device.
+    def __init__(self, dim: int, base: float = 10000.0):
+        super().__init__()
+        inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2, dtype=torch.float32) / dim))
+        self.register_buffer("inv_freq", inv_freq, persistent=False)
+        self._seq_len_cached = 0
+        self._cos_cached: Tensor | None = None
+        self._sin_cached: Tensor | None = None
+
+    def forward(self, seq_len: int, device: torch.device, dtype: torch.dtype) -> tuple[Tensor, Tensor]:
+        if (
+            self._cos_cached is None
+            or self._sin_cached is None
+            or self._seq_len_cached != seq_len
+            or self._cos_cached.device != device
+        ):
+            t = torch.arange(seq_len, device=device, dtype=self.inv_freq.dtype)
+            freqs = torch.outer(t, self.inv_freq.to(device))
+            self._cos_cached = freqs.cos()[None, None, :, :]
+            self._sin_cached = freqs.sin()[None, None, :, :]
+            self._seq_len_cached = seq_len
+        return self._cos_cached.to(dtype=dtype), self._sin_cached.to(dtype=dtype)
+
+
+def apply_rotary_emb(x: Tensor, cos: Tensor, sin: Tensor) -> Tensor:
+    half = x.size(-1) // 2
+    x1, x2 = x[..., :half], x[..., half:]
+    return torch.cat((x1 * cos + x2 * sin, x1 * (-sin) + x2 * cos), dim=-1)
+
+
+class CausalSelfAttention(nn.Module):
+    def __init__(
+        self,
+        dim: int,
+        num_heads: int,
+        num_kv_heads: int,
+        rope_base: float,
+        qk_gain_init: float,
+    ):
+        super().__init__()
+        if dim % num_heads != 0:
+            raise ValueError("model_dim must be divisible by num_heads")
+        if num_heads % num_kv_heads != 0:
+            raise ValueError("num_heads must be divisible by num_kv_heads")
+        self.num_heads = num_heads
+        self.num_kv_heads = num_kv_heads
+        self.head_dim = dim // num_heads
+        if self.head_dim % 2 != 0:
+            raise ValueError("head_dim must be even for RoPE")
+        kv_dim = self.num_kv_heads * self.head_dim
+        self.c_q = CastedLinear(dim, dim, bias=False)
+        self.c_k = CastedLinear(dim, kv_dim, bias=False)
+        self.c_v = CastedLinear(dim, kv_dim, bias=False)
+        self.proj = CastedLinear(dim, dim, bias=False)
+        self.proj._zero_init = True
+        self.q_gain = nn.Parameter(torch.full((num_heads,), qk_gain_init, dtype=torch.float32))
+        self.rotary = Rotary(self.head_dim, base=rope_base)
+
+    def forward(self, x: Tensor) -> Tensor:
+        bsz, seqlen, dim = x.shape
+        q = self.c_q(x).reshape(bsz, seqlen, self.num_heads, self.head_dim).transpose(1, 2)
+        k = self.c_k(x).reshape(bsz, seqlen, self.num_kv_heads, self.head_dim).transpose(1, 2)
+        v = self.c_v(x).reshape(bsz, seqlen, self.num_kv_heads, self.head_dim).transpose(1, 2)
+        q = F.rms_norm(q, (q.size(-1),))
+        k = F.rms_norm(k, (k.size(-1),))
+        cos, sin = self.rotary(seqlen, x.device, q.dtype)
+        q = apply_rotary_emb(q, cos, sin)
+        k = apply_rotary_emb(k, cos, sin)
+        q = q * self.q_gain.to(dtype=q.dtype)[None, :, None, None]
+        y = F.scaled_dot_product_attention(
+            q,
+            k,
+            v,
+            attn_mask=None,
+            is_causal=True,
+            enable_gqa=(self.num_kv_heads != self.num_heads),
+        )
+        y = y.transpose(1, 2).contiguous().reshape(bsz, seqlen, dim)
+        return self.proj(y)
+
+
+class MLP(nn.Module):
+    # relu^2 MLP from the original modded-nanogpt setup
+    def __init__(self, dim: int, mlp_mult: int):
+        super().__init__()
+        hidden = mlp_mult * dim
+        self.fc = CastedLinear(dim, hidden, bias=False)
+        self.proj = CastedLinear(hidden, dim, bias=False)
+        self.proj._zero_init = True
+
+    def forward(self, x: Tensor) -> Tensor:
+        x = torch.relu(self.fc(x))
+        return self.proj(x.square())
+
+
+class Block(nn.Module):
+    def __init__(
+        self,
+        dim: int,
+        num_heads: int,
+        num_kv_heads: int,
+        mlp_mult: int,
+        rope_base: float,
+        qk_gain_init: float,
+    ):
+        super().__init__()
+        self.attn_norm = RMSNorm()
+        self.mlp_norm = RMSNorm()
+        self.attn = CausalSelfAttention(dim, num_heads, num_kv_heads, rope_base, qk_gain_init)
+        self.mlp = MLP(dim, mlp_mult)
+        self.attn_scale = nn.Parameter(torch.ones(dim, dtype=torch.float32))
+        self.mlp_scale = nn.Parameter(torch.ones(dim, dtype=torch.float32))
+        self.resid_mix = nn.Parameter(torch.stack((torch.ones(dim), torch.zeros(dim))).float())
+
+    def forward(self, x: Tensor, x0: Tensor) -> Tensor:
+        mix = self.resid_mix.to(dtype=x.dtype)
+        x = mix[0][None, None, :] * x + mix[1][None, None, :] * x0
+        attn_out = self.attn(self.attn_norm(x))
+        x = x + self.attn_scale.to(dtype=x.dtype)[None, None, :] * attn_out
+        x = x + self.mlp_scale.to(dtype=x.dtype)[None, None, :] * self.mlp(self.mlp_norm(x))
+        return x
+
+
+class GPT(nn.Module):
+    def __init__(
+        self,
+        vocab_size: int,
+        num_layers: int,
+        model_dim: int,
+        num_heads: int,
+        num_kv_heads: int,
+        mlp_mult: int,
+        tie_embeddings: bool,
+        tied_embed_init_std: float,
+        logit_softcap: float,
+        rope_base: float,
+        qk_gain_init: float,
+    ):
+        super().__init__()
+        if logit_softcap <= 0.0:
+            raise ValueError(f"logit_softcap must be positive, got {logit_softcap}")
+        self.tie_embeddings = tie_embeddings
+        self.tied_embed_init_std = tied_embed_init_std
+        self.logit_softcap = logit_softcap
+        self.tok_emb = nn.Embedding(vocab_size, model_dim)
+        self.num_encoder_layers = num_layers // 2
+        self.num_decoder_layers = num_layers - self.num_encoder_layers
+        self.num_skip_weights = min(self.num_encoder_layers, self.num_decoder_layers)
+        self.skip_weights = nn.Parameter(torch.ones(self.num_skip_weights, model_dim, dtype=torch.float32))
+        self.blocks = nn.ModuleList(
+            [
+                Block(
+                    model_dim,
+                    num_heads,
+                    num_kv_heads,
+                    mlp_mult,
+                    rope_base,
+                    qk_gain_init,
+                )
+                for i in range(num_layers)
+            ]
+        )
+        self.final_norm = RMSNorm()
+        self.lm_head = None if tie_embeddings else CastedLinear(model_dim, vocab_size, bias=False)
+        if self.lm_head is not None:
+            self.lm_head._zero_init = True
+        self._init_weights()
+
+    def _init_weights(self) -> None:
+        if self.tie_embeddings:
+            nn.init.normal_(self.tok_emb.weight, mean=0.0, std=self.tied_embed_init_std)
+        for module in self.modules():
+            if isinstance(module, nn.Linear) and getattr(module, "_zero_init", False):
+                nn.init.zeros_(module.weight)
+
+    def forward(self, input_ids: Tensor, target_ids: Tensor) -> Tensor:
+        x = self.tok_emb(input_ids)
+        x = F.rms_norm(x, (x.size(-1),))
+        x0 = x
+        skips: list[Tensor] = []
+
+        # First half stores skips; second half reuses them in reverse order.
+        for i in range(self.num_encoder_layers):
+            x = self.blocks[i](x, x0)
+            skips.append(x)
+        for i in range(self.num_decoder_layers):
+            if skips:
+                x = x + self.skip_weights[i].to(dtype=x.dtype)[None, None, :] * skips.pop()
+            x = self.blocks[self.num_encoder_layers + i](x, x0)
+
+        x = self.final_norm(x).reshape(-1, x.size(-1))
+        targets = target_ids.reshape(-1)
+        if self.tie_embeddings:
+            logits_proj = F.linear(x, self.tok_emb.weight)
+        else:
+            if self.lm_head is None:
+                raise RuntimeError("lm_head is required when tie_embeddings=False")
+            logits_proj = self.lm_head(x)
+        logits = self.logit_softcap * torch.tanh(logits_proj / self.logit_softcap)
+        return F.cross_entropy(logits.float(), targets, reduction="mean")
+
+
+# -----------------------------
+# TRAINING
+# -----------------------------
+
+def main() -> None:
+    global zeropower_via_newtonschulz5
+
+    code = Path(__file__).read_text(encoding="utf-8")
+    args = Hyperparameters()
+    zeropower_via_newtonschulz5 = torch.compile(zeropower_via_newtonschulz5)
+
+    # -----------------------------
+    # DISTRIBUTED + CUDA SETUP
+    # -----------------------------
+
+    distributed = "RANK" in os.environ and "WORLD_SIZE" in os.environ
+    rank = int(os.environ.get("RANK", "0"))
+    world_size = int(os.environ.get("WORLD_SIZE", "1"))
+    local_rank = int(os.environ.get("LOCAL_RANK", "0"))
+    if world_size <= 0:
+        raise ValueError(f"WORLD_SIZE must be positive, got {world_size}")
+    if 8 % world_size != 0:
+        raise ValueError(f"WORLD_SIZE={world_size} must divide 8 so grad_accum_steps stays integral")
+    grad_accum_steps = 8 // world_size
+    grad_scale = 1.0 / grad_accum_steps
+    if not torch.cuda.is_available():
+        raise RuntimeError("CUDA is required")
+    device = torch.device("cuda", local_rank)
+    torch.cuda.set_device(device)
+    if distributed:
+        dist.init_process_group(backend="nccl", device_id=device)
+        dist.barrier()
+    master_process = rank == 0
+
+    # Fast math knobs
+    torch.backends.cuda.matmul.allow_tf32 = True
+    torch.backends.cudnn.allow_tf32 = True
+    from torch.backends.cuda import enable_cudnn_sdp, enable_flash_sdp, enable_math_sdp, enable_mem_efficient_sdp
+
+    enable_cudnn_sdp(False)
+    enable_flash_sdp(True)
+    enable_mem_efficient_sdp(False)
+    enable_math_sdp(False)
+
+    logfile = None
+    if master_process:
+        os.makedirs("logs", exist_ok=True)
+        logfile = f"logs/{args.run_id}.txt"
+        print(logfile)
+
+    def log0(msg: str, console: bool = True) -> None:
+        if not master_process:
+            return
+        if console:
+            print(msg)
+        if logfile is not None:
+            with open(logfile, "a", encoding="utf-8") as f:
+                print(msg, file=f)
+
+    log0(code, console=False)
+    log0("=" * 100, console=False)
+    log0(f"Running Python {sys.version}", console=False)
+    log0(f"Running PyTorch {torch.__version__}", console=False)
+    log0(
+        subprocess.run(["nvidia-smi"], stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True, check=False).stdout,
+        console=False,
+    )
+    log0("=" * 100, console=False)
+
+    # -----------------------------
+    # TOKENIZER + VALIDATION METRIC SETUP
+    # -----------------------------
+
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    torch.manual_seed(args.seed)
+    torch.cuda.manual_seed_all(args.seed)
+
+    if not args.tokenizer_path.endswith(".model"):
+        raise ValueError(f"Script only setup for SentencePiece .model file: {args.tokenizer_path}")
+    sp = spm.SentencePieceProcessor(model_file=args.tokenizer_path)
+    if int(sp.vocab_size()) != args.vocab_size:
+        raise ValueError(
+            f"VOCAB_SIZE={args.vocab_size} does not match tokenizer vocab_size={int(sp.vocab_size())}"
+        )
+    dataset_dir = Path(args.data_path).resolve()
+    actual_train_files = len(list(dataset_dir.glob("fineweb_train_*.bin")))
+    val_tokens = load_validation_tokens(args.val_files, args.train_seq_len)
+    base_bytes_lut, has_leading_space_lut, is_boundary_token_lut = build_sentencepiece_luts(
+        sp, args.vocab_size, device
+    )
+    log0(f"val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path={args.tokenizer_path}")
+    log0(f"train_loader:dataset:{dataset_dir.name} train_shards:{actual_train_files}")
+    log0(f"val_loader:shards pattern={args.val_files} tokens:{val_tokens.numel() - 1}")
+
+    # -----------------------------
+    # MODEL + OPTIMIZER SETUP
+    # -----------------------------
+
+    base_model = GPT(
+        vocab_size=args.vocab_size,
+        num_layers=args.num_layers,
+        model_dim=args.model_dim,
+        num_heads=args.num_heads,
+        num_kv_heads=args.num_kv_heads,
+        mlp_mult=args.mlp_mult,
+        tie_embeddings=args.tie_embeddings,
+        tied_embed_init_std=args.tied_embed_init_std,
+        logit_softcap=args.logit_softcap,
+        rope_base=args.rope_base,
+        qk_gain_init=args.qk_gain_init,
+    ).to(device).bfloat16()
+    for module in base_model.modules():
+        if isinstance(module, CastedLinear):
+            module.float()
+    restore_low_dim_params_to_fp32(base_model)
+    compiled_model = torch.compile(base_model, dynamic=False, fullgraph=True)
+    model: nn.Module = DDP(compiled_model, device_ids=[local_rank], broadcast_buffers=False) if distributed else compiled_model
+
+    # Optimizer split:
+    # - token embedding (Adam) uses EMBED_LR
+    # - untied lm_head (Adam) uses HEAD_LR
+    # - matrix params in transformer blocks use MATRIX_LR via Muon
+    # - vectors/scalars use SCALAR_LR via Adam
+    block_named_params = list(base_model.blocks.named_parameters())
+    matrix_params = [
+        p
+        for name, p in block_named_params
+        if p.ndim == 2 and not any(pattern in name for pattern in CONTROL_TENSOR_NAME_PATTERNS)
+    ]
+    scalar_params = [
+        p
+        for name, p in block_named_params
+        if p.ndim < 2 or any(pattern in name for pattern in CONTROL_TENSOR_NAME_PATTERNS)
+    ]
+    if base_model.skip_weights.numel() > 0:
+        scalar_params.append(base_model.skip_weights)
+    token_lr = args.tied_embed_lr if args.tie_embeddings else args.embed_lr
+    optimizer_tok = torch.optim.Adam(
+        [{"params": [base_model.tok_emb.weight], "lr": token_lr, "base_lr": token_lr}],
+        betas=(args.beta1, args.beta2),
+        eps=args.adam_eps,
+        fused=True,
+    )
+    optimizer_muon = Muon(
+        matrix_params,
+        lr=args.matrix_lr,
+        momentum=args.muon_momentum,
+        backend_steps=args.muon_backend_steps,
+    )
+    for group in optimizer_muon.param_groups:
+        group["base_lr"] = args.matrix_lr
+    optimizer_scalar = torch.optim.Adam(
+        [{"params": scalar_params, "lr": args.scalar_lr, "base_lr": args.scalar_lr}],
+        betas=(args.beta1, args.beta2),
+        eps=args.adam_eps,
+        fused=True,
+    )
+    optimizers: list[torch.optim.Optimizer] = [optimizer_tok, optimizer_muon, optimizer_scalar]
+    if base_model.lm_head is not None:
+        optimizer_head = torch.optim.Adam(
+            [{"params": [base_model.lm_head.weight], "lr": args.head_lr, "base_lr": args.head_lr}],
+            betas=(args.beta1, args.beta2),
+            eps=args.adam_eps,
+            fused=True,
+        )
+        optimizers.insert(1, optimizer_head)
+
+    n_params = sum(p.numel() for p in base_model.parameters())
+    log0(f"model_params:{n_params}")
+    log0(f"world_size:{world_size} grad_accum_steps:{grad_accum_steps}")
+    log0("sdp_backends:cudnn=False flash=True mem_efficient=False math=False")
+    log0(f"attention_mode:gqa num_heads:{args.num_heads} num_kv_heads:{args.num_kv_heads}")
+    log0(
+        f"tie_embeddings:{args.tie_embeddings} embed_lr:{token_lr} "
+        f"head_lr:{args.head_lr if base_model.lm_head is not None else 0.0} "
+        f"matrix_lr:{args.matrix_lr} scalar_lr:{args.scalar_lr}"
+    )
+    log0(
+        f"train_batch_tokens:{args.train_batch_tokens} train_seq_len:{args.train_seq_len} "
+        f"iterations:{args.iterations} warmup_steps:{args.warmup_steps} "
+        f"max_wallclock_seconds:{args.max_wallclock_seconds:.3f}"
+    )
+    log0(f"seed:{args.seed}")
+
+    # -----------------------------
+    # DATA LOADER & MODEL WARMUP
+    # -----------------------------
+
+    train_loader = DistributedTokenLoader(args.train_files, rank, world_size, device)
+
+    def zero_grad_all() -> None:
+        for opt in optimizers:
+            opt.zero_grad(set_to_none=True)
+
+    max_wallclock_ms = 1000.0 * args.max_wallclock_seconds if args.max_wallclock_seconds > 0 else None
+
+    def lr_mul(step: int, elapsed_ms: float) -> float:
+        if args.warmdown_iters <= 0:
+            return 1.0
+        if max_wallclock_ms is None:
+            warmdown_start = max(args.iterations - args.warmdown_iters, 0)
+            return max((args.iterations - step) / max(args.warmdown_iters, 1), 0.0) if warmdown_start <= step < args.iterations else 1.0
+        step_ms = elapsed_ms / max(step, 1)
+        warmdown_ms = args.warmdown_iters * step_ms
+        remaining_ms = max(max_wallclock_ms - elapsed_ms, 0.0)
+        return remaining_ms / max(warmdown_ms, 1e-9) if remaining_ms <= warmdown_ms else 1.0
+
+    # Warmup primes the compiled forward/backward/optimizer paths, then we restore the
+    # initial weights/optimizer state so measured training starts from the true init.
+    if args.warmup_steps > 0:
+        initial_model_state = {name: tensor.detach().cpu().clone() for name, tensor in base_model.state_dict().items()}
+        initial_optimizer_states = [copy.deepcopy(opt.state_dict()) for opt in optimizers]
+        model.train()
+        for warmup_step in range(args.warmup_steps):
+            zero_grad_all()
+            for micro_step in range(grad_accum_steps):
+                if distributed:
+                    model.require_backward_grad_sync = micro_step == grad_accum_steps - 1
+                x, y = train_loader.next_batch(args.train_batch_tokens, args.train_seq_len, grad_accum_steps)
+                with torch.autocast(device_type="cuda", dtype=torch.bfloat16, enabled=True):
+                    warmup_loss = model(x, y)
+                (warmup_loss * grad_scale).backward()
+            for opt in optimizers:
+                opt.step()
+            zero_grad_all()
+            if args.warmup_steps <= 20 or (warmup_step + 1) % 10 == 0 or warmup_step + 1 == args.warmup_steps:
+                log0(f"warmup_step:{warmup_step + 1}/{args.warmup_steps}")
+        base_model.load_state_dict(initial_model_state, strict=True)
+        for opt, state in zip(optimizers, initial_optimizer_states, strict=True):
+            opt.load_state_dict(state)
+        zero_grad_all()
+        if distributed:
+            model.require_backward_grad_sync = True
+        train_loader = DistributedTokenLoader(args.train_files, rank, world_size, device)
+
+    # -----------------------------
+    # MAIN TRAINING LOOP
+    # -----------------------------
+
+    training_time_ms = 0.0
+    stop_after_step: int | None = None
+    torch.cuda.synchronize()
+    t0 = time.perf_counter()
+
+    step = 0
+    while True:
+        last_step = step == args.iterations or (stop_after_step is not None and step >= stop_after_step)
+
+        should_validate = last_step or (args.val_loss_every > 0 and step % args.val_loss_every == 0)
+        if should_validate:
+            torch.cuda.synchronize()
+            training_time_ms += 1000.0 * (time.perf_counter() - t0)
+            val_loss, val_bpb = eval_val(
+                args,
+                model,
+                rank,
+                world_size,
+                device,
+                grad_accum_steps,
+                val_tokens,
+                base_bytes_lut,
+                has_leading_space_lut,
+                is_boundary_token_lut,
+            )
+            log0(
+                f"step:{step}/{args.iterations} val_loss:{val_loss:.4f} val_bpb:{val_bpb:.4f} "
+                f"train_time:{training_time_ms:.0f}ms step_avg:{training_time_ms / max(step, 1):.2f}ms"
+            )
+            torch.cuda.synchronize()
+            t0 = time.perf_counter()
+
+        if last_step:
+            if stop_after_step is not None and step < args.iterations:
+                log0(
+                    f"stopping_early: wallclock_cap train_time:{training_time_ms:.0f}ms "
+                    f"step:{step}/{args.iterations}"
+                )
+            break
+
+        elapsed_ms = training_time_ms + 1000.0 * (time.perf_counter() - t0)
+        scale = lr_mul(step, elapsed_ms)
+        zero_grad_all()
+        train_loss = torch.zeros((), device=device)
+        for micro_step in range(grad_accum_steps):
+            if distributed:
+                model.require_backward_grad_sync = micro_step == grad_accum_steps - 1
+            x, y = train_loader.next_batch(args.train_batch_tokens, args.train_seq_len, grad_accum_steps)
+            with torch.autocast(device_type="cuda", dtype=torch.bfloat16, enabled=True):
+                loss = model(x, y)
+            train_loss += loss.detach()
+            (loss * grad_scale).backward()
+        train_loss /= grad_accum_steps
+
+        frac = min(step / args.muon_momentum_warmup_steps, 1.0) if args.muon_momentum_warmup_steps > 0 else 1.0
+        muon_momentum = (1 - frac) * args.muon_momentum_warmup_start + frac * args.muon_momentum
+        for group in optimizer_muon.param_groups:
+            group["momentum"] = muon_momentum
+
+        for opt in optimizers:
+            for group in opt.param_groups:
+                group["lr"] = group["base_lr"] * scale
+
+        if args.grad_clip_norm > 0:
+            torch.nn.utils.clip_grad_norm_(base_model.parameters(), args.grad_clip_norm)
+        for opt in optimizers:
+            opt.step()
+        zero_grad_all()
+
+        step += 1
+        approx_training_time_ms = training_time_ms + 1000.0 * (time.perf_counter() - t0)
+        should_log_train = (
+            args.train_log_every > 0
+            and (step <= 10 or step % args.train_log_every == 0 or stop_after_step is not None)
+        )
+        if should_log_train:
+            log0(
+                f"step:{step}/{args.iterations} train_loss:{train_loss.item():.4f} "
+                f"train_time:{approx_training_time_ms:.0f}ms step_avg:{approx_training_time_ms / step:.2f}ms"
+            )
+
+        # Needed to sync whether we've reached the wallclock cap.
+        reached_cap = max_wallclock_ms is not None and approx_training_time_ms >= max_wallclock_ms
+        if distributed and max_wallclock_ms is not None:
+            reached_cap_tensor = torch.tensor(int(reached_cap), device=device)
+            dist.all_reduce(reached_cap_tensor, op=dist.ReduceOp.MAX)
+            reached_cap = bool(reached_cap_tensor.item())
+        if stop_after_step is None and reached_cap:
+            stop_after_step = step
+
+    log0(
+        f"peak memory allocated: {torch.cuda.max_memory_allocated() // 1024 // 1024} MiB "
+        f"reserved: {torch.cuda.max_memory_reserved() // 1024 // 1024} MiB"
+    )
+
+    # -----------------------------
+    # SERIALIZATION + ROUNDTRIP VALIDATION
+    # -----------------------------
+    # Save the raw state (useful for debugging/loading in PyTorch directly), then always produce
+    # the compressed int8+zlib artifact and validate the round-tripped weights.
+
+    if master_process:
+        torch.save(base_model.state_dict(), "final_model.pt")
+        model_bytes = os.path.getsize("final_model.pt")
+        code_bytes = len(code.encode("utf-8"))
+        log0(f"Serialized model: {model_bytes} bytes")
+        log0(f"Code size: {code_bytes} bytes")
+        log0(f"Total submission size: {model_bytes + code_bytes} bytes")
+
+    quant_obj, quant_stats = quantize_state_dict_int8(base_model.state_dict())
+    quant_buf = io.BytesIO()
+    torch.save(quant_obj, quant_buf)
+    quant_raw = quant_buf.getvalue()
+    quant_blob = zlib.compress(quant_raw, level=9)
+    quant_raw_bytes = len(quant_raw)
+    if master_process:
+        with open("final_model.int8.ptz", "wb") as f:
+            f.write(quant_blob)
+        quant_file_bytes = os.path.getsize("final_model.int8.ptz")
+        code_bytes = len(code.encode("utf-8"))
+        ratio = quant_stats["baseline_tensor_bytes"] / max(quant_stats["int8_payload_bytes"], 1)
+        log0(
+            f"Serialized model int8+zlib: {quant_file_bytes} bytes "
+            f"(payload:{quant_stats['int8_payload_bytes']} raw_torch:{quant_raw_bytes} payload_ratio:{ratio:.2f}x)"
+        )
+        log0(f"Total submission size int8+zlib: {quant_file_bytes + code_bytes} bytes")
+
+    if distributed:
+        dist.barrier()
+    with open("final_model.int8.ptz", "rb") as f:
+        quant_blob_disk = f.read()
+    quant_state = torch.load(io.BytesIO(zlib.decompress(quant_blob_disk)), map_location="cpu")
+    base_model.load_state_dict(dequantize_state_dict_int8(quant_state), strict=True)
+    torch.cuda.synchronize()
+    t_qeval = time.perf_counter()
+    q_val_loss, q_val_bpb = eval_val(
+        args,
+        model,
+        rank,
+        world_size,
+        device,
+        grad_accum_steps,
+        val_tokens,
+        base_bytes_lut,
+        has_leading_space_lut,
+        is_boundary_token_lut,
+    )
+    torch.cuda.synchronize()
+    log0(
+        f"final_int8_zlib_roundtrip val_loss:{q_val_loss:.4f} val_bpb:{q_val_bpb:.4f} "
+        f"eval_time:{1000.0 * (time.perf_counter() - t_qeval):.0f}ms"
+    )
+    log0(f"final_int8_zlib_roundtrip_exact val_loss:{q_val_loss:.8f} val_bpb:{q_val_bpb:.8f}")
+
+    if distributed:
+        dist.destroy_process_group()
+
+
+if __name__ == "__main__":
+    main()
+
+====================================================================================================
+Running Python 3.12.9 (main, Jan 22 2026, 18:37:37) [GCC 13.3.0]
+Running PyTorch 2.10.0+cu128
+Wed Mar 18 07:51:20 2026       
++-----------------------------------------------------------------------------------------+
+| NVIDIA-SMI 565.57.01              Driver Version: 565.57.01      CUDA Version: 13.0     |
+|-----------------------------------------+------------------------+----------------------+
+| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
+| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
+|                                         |                        |               MIG M. |
+|=========================================+========================+======================|
+|   0  NVIDIA H100 80GB HBM3          On  |   00000001:00:00.0 Off |                    0 |
+| N/A   31C    P0            119W /  700W |    1774MiB /  81559MiB |      0%      Default |
+|                                         |                        |             Disabled |
++-----------------------------------------+------------------------+----------------------+
+|   1  NVIDIA H100 80GB HBM3          On  |   00000002:00:00.0 Off |                    0 |
+| N/A   31C    P0            120W /  700W |    1822MiB /  81559MiB |      0%      Default |
+|                                         |                        |             Disabled |
++-----------------------------------------+------------------------+----------------------+
+|   2  NVIDIA H100 80GB HBM3          On  |   00000003:00:00.0 Off |                    0 |
+| N/A   30C    P0            117W /  700W |    1822MiB /  81559MiB |      0%      Default |
+|                                         |                        |             Disabled |
++-----------------------------------------+------------------------+----------------------+
+|   3  NVIDIA H100 80GB HBM3          On  |   00000008:00:00.0 Off |                    0 |
+| N/A   30C    P0            119W /  700W |    1822MiB /  81559MiB |      0%      Default |
+|                                         |                        |             Disabled |
++-----------------------------------------+------------------------+----------------------+
+|   4  NVIDIA H100 80GB HBM3          On  |   00000009:00:00.0 Off |                    0 |
+| N/A   29C    P0            120W /  700W |    1822MiB /  81559MiB |      0%      Default |
+|                                         |                        |             Disabled |
++-----------------------------------------+------------------------+----------------------+
+|   5  NVIDIA H100 80GB HBM3          On  |   0000000A:00:00.0 Off |                    0 |
+| N/A   31C    P0            119W /  700W |    1822MiB /  81559MiB |      0%      Default |
+|                                         |                        |             Disabled |
++-----------------------------------------+------------------------+----------------------+
+|   6  NVIDIA H100 80GB HBM3          On  |   0000000B:00:00.0 Off |                    0 |
+| N/A   30C    P0            116W /  700W |    1822MiB /  81559MiB |      0%      Default |
+|                                         |                        |             Disabled |
++-----------------------------------------+------------------------+----------------------+
+|   7  NVIDIA H100 80GB HBM3          On  |   0000000C:00:00.0 Off |                    0 |
+| N/A   30C    P0            118W /  700W |    1582MiB /  81559MiB |      0%      Default |
+|                                         |                        |             Disabled |
++-----------------------------------------+------------------------+----------------------+
+                                                                                         
++-----------------------------------------------------------------------------------------+
+| Processes:                                                                              |
+|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
+|        ID   ID                                                               Usage      |
+|=========================================================================================|
+|    0   N/A  N/A    987032      C   ...t/.pyenv/versions/3.12.9/bin/python       1764MiB |
+|    1   N/A  N/A    987033      C   ...t/.pyenv/versions/3.12.9/bin/python       1812MiB |
+|    2   N/A  N/A    987034      C   ...t/.pyenv/versions/3.12.9/bin/python       1812MiB |
+|    3   N/A  N/A    987035      C   ...t/.pyenv/versions/3.12.9/bin/python       1812MiB |
+|    4   N/A  N/A    987036      C   ...t/.pyenv/versions/3.12.9/bin/python       1812MiB |
+|    5   N/A  N/A    987037      C   ...t/.pyenv/versions/3.12.9/bin/python       1812MiB |
+|    6   N/A  N/A    987038      C   ...t/.pyenv/versions/3.12.9/bin/python       1812MiB |
+|    7   N/A  N/A    987039      C   ...t/.pyenv/versions/3.12.9/bin/python       1572MiB |
++-----------------------------------------------------------------------------------------+
+
+====================================================================================================
+val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=/tmp/fineweb_quasi10Bfrom50B_50keval_sp1024_v0/tokenizers/fineweb_1024_bpe.model
+train_loader:dataset:fineweb10B_sp1024 train_shards:195
+val_loader:shards pattern=/tmp/fineweb_quasi10Bfrom50B_50keval_sp1024_v0/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632
+model_params:17059912
+world_size:8 grad_accum_steps:1
+sdp_backends:cudnn=False flash=True mem_efficient=False math=False
+attention_mode:gqa num_heads:8 num_kv_heads:4
+tie_embeddings:True embed_lr:0.05 head_lr:0.0 matrix_lr:0.04 scalar_lr:0.04
+train_batch_tokens:524288 train_seq_len:1024 iterations:500000 warmup_steps:20 max_wallclock_seconds:14400.000
+seed:1337
+warmup_step:1/20
+warmup_step:2/20
+warmup_step:3/20
+warmup_step:4/20
+warmup_step:5/20
+warmup_step:6/20
+warmup_step:7/20
+warmup_step:8/20
+warmup_step:9/20
+warmup_step:10/20
+warmup_step:11/20
+warmup_step:12/20
+warmup_step:13/20
+warmup_step:14/20
+warmup_step:15/20
+warmup_step:16/20
+warmup_step:17/20
+warmup_step:18/20
+warmup_step:19/20
+warmup_step:20/20
+step:0/500000 val_loss:6.9357 val_bpb:4.1077 train_time:0ms step_avg:0.02ms
+step:1/500000 train_loss:6.9370 train_time:33ms step_avg:32.84ms
+step:2/500000 train_loss:16.8366 train_time:77ms step_avg:38.37ms
+step:3/500000 train_loss:8.7611 train_time:120ms step_avg:39.91ms
+step:4/500000 train_loss:6.6384 train_time:162ms step_avg:40.39ms
+step:5/500000 train_loss:6.6118 train_time:206ms step_avg:41.14ms
+step:6/500000 train_loss:7.4217 train_time:249ms step_avg:41.44ms
+step:7/500000 train_loss:6.3501 train_time:292ms step_avg:41.72ms
+step:8/500000 train_loss:6.1581 train_time:336ms step_avg:41.97ms
+step:9/500000 train_loss:6.0678 train_time:378ms step_avg:42.02ms
+step:10/500000 train_loss:5.9746 train_time:422ms step_avg:42.15ms
+step:200/500000 train_loss:2.8504 train_time:8738ms step_avg:43.69ms
+step:400/500000 train_loss:2.3638 train_time:17484ms step_avg:43.71ms
+step:600/500000 train_loss:2.5453 train_time:26216ms step_avg:43.69ms
+step:800/500000 train_loss:2.2913 train_time:34960ms step_avg:43.70ms
+step:1000/500000 train_loss:2.3707 train_time:43708ms step_avg:43.71ms
+step:1200/500000 train_loss:2.3889 train_time:52441ms step_avg:43.70ms
+step:1400/500000 train_loss:2.4331 train_time:61180ms step_avg:43.70ms
+step:1600/500000 train_loss:2.0960 train_time:69912ms step_avg:43.70ms
+step:1800/500000 train_loss:2.1998 train_time:78649ms step_avg:43.69ms
+step:2000/500000 train_loss:2.2506 train_time:87392ms step_avg:43.70ms
+step:2200/500000 train_loss:2.0768 train_time:96125ms step_avg:43.69ms
+step:2400/500000 train_loss:2.2012 train_time:104860ms step_avg:43.69ms
+step:2600/500000 train_loss:2.4155 train_time:113597ms step_avg:43.69ms
+step:2800/500000 train_loss:2.2353 train_time:122349ms step_avg:43.70ms
+step:3000/500000 train_loss:2.2275 train_time:131087ms step_avg:43.70ms
+step:3200/500000 train_loss:2.1879 train_time:139825ms step_avg:43.70ms
+step:3400/500000 train_loss:2.1611 train_time:148561ms step_avg:43.69ms
+step:3600/500000 train_loss:2.1179 train_time:157297ms step_avg:43.69ms
+step:3800/500000 train_loss:2.2211 train_time:166018ms step_avg:43.69ms
+step:4000/500000 train_loss:2.1627 train_time:174758ms step_avg:43.69ms
+step:4200/500000 train_loss:2.1757 train_time:183623ms step_avg:43.72ms
+step:4400/500000 train_loss:2.1143 train_time:192357ms step_avg:43.72ms
+step:4600/500000 train_loss:1.9730 train_time:201085ms step_avg:43.71ms
+step:4800/500000 train_loss:2.2618 train_time:209818ms step_avg:43.71ms
+step:5000/500000 train_loss:2.0275 train_time:218548ms step_avg:43.71ms
+step:5200/500000 train_loss:2.1723 train_time:227276ms step_avg:43.71ms
+step:5400/500000 train_loss:2.1842 train_time:235999ms step_avg:43.70ms
+step:5600/500000 train_loss:2.1841 train_time:244724ms step_avg:43.70ms
+step:5800/500000 train_loss:2.1435 train_time:253451ms step_avg:43.70ms
+step:6000/500000 train_loss:2.2229 train_time:262182ms step_avg:43.70ms
+step:6200/500000 train_loss:2.0860 train_time:270912ms step_avg:43.70ms
+step:6400/500000 train_loss:2.1598 train_time:279652ms step_avg:43.70ms
+step:6600/500000 train_loss:2.1228 train_time:288379ms step_avg:43.69ms
+step:6800/500000 train_loss:2.1890 train_time:297103ms step_avg:43.69ms
+step:7000/500000 train_loss:2.2279 train_time:305832ms step_avg:43.69ms
+step:7200/500000 train_loss:2.1999 train_time:314562ms step_avg:43.69ms
+step:7400/500000 train_loss:2.1173 train_time:323296ms step_avg:43.69ms
+step:7600/500000 train_loss:1.9995 train_time:332032ms step_avg:43.69ms
+step:7800/500000 train_loss:2.1500 train_time:340763ms step_avg:43.69ms
+step:8000/500000 train_loss:2.1150 train_time:349492ms step_avg:43.69ms
+step:8200/500000 train_loss:2.1884 train_time:358222ms step_avg:43.69ms
+step:8400/500000 train_loss:2.1346 train_time:367069ms step_avg:43.70ms
+step:8600/500000 train_loss:2.1364 train_time:375806ms step_avg:43.70ms
+step:8800/500000 train_loss:2.0998 train_time:384530ms step_avg:43.70ms
+step:9000/500000 train_loss:2.0268 train_time:393259ms step_avg:43.70ms
+step:9200/500000 train_loss:2.0837 train_time:401988ms step_avg:43.69ms
+step:9400/500000 train_loss:2.1327 train_time:410718ms step_avg:43.69ms
+step:9600/500000 train_loss:2.1452 train_time:419449ms step_avg:43.69ms
+step:9800/500000 train_loss:2.0746 train_time:428188ms step_avg:43.69ms
+step:10000/500000 train_loss:2.1114 train_time:436922ms step_avg:43.69ms
+step:10200/500000 train_loss:2.0700 train_time:445656ms step_avg:43.69ms
+step:10400/500000 train_loss:2.0994 train_time:454390ms step_avg:43.69ms
+step:10600/500000 train_loss:1.9771 train_time:463127ms step_avg:43.69ms
+step:10800/500000 train_loss:2.1884 train_time:471870ms step_avg:43.69ms
+step:11000/500000 train_loss:2.1136 train_time:480607ms step_avg:43.69ms
+step:11200/500000 train_loss:2.0714 train_time:489337ms step_avg:43.69ms
+step:11400/500000 train_loss:2.0570 train_time:498065ms step_avg:43.69ms
+step:11600/500000 train_loss:2.0643 train_time:506798ms step_avg:43.69ms
+step:11800/500000 train_loss:2.0966 train_time:515533ms step_avg:43.69ms
+step:12000/500000 train_loss:2.0734 train_time:524265ms step_avg:43.69ms
+step:12200/500000 train_loss:2.2214 train_time:532995ms step_avg:43.69ms
+step:12400/500000 train_loss:1.8632 train_time:541847ms step_avg:43.70ms
+step:12600/500000 train_loss:2.0944 train_time:550591ms step_avg:43.70ms
+step:12800/500000 train_loss:2.1093 train_time:559322ms step_avg:43.70ms
+step:13000/500000 train_loss:2.1906 train_time:568053ms step_avg:43.70ms
+step:13200/500000 train_loss:2.2034 train_time:576786ms step_avg:43.70ms
+step:13400/500000 train_loss:2.0818 train_time:585518ms step_avg:43.70ms
+step:13600/500000 train_loss:1.9545 train_time:594253ms step_avg:43.70ms
+step:13800/500000 train_loss:2.0350 train_time:602978ms step_avg:43.69ms
+step:14000/500000 train_loss:2.0960 train_time:611714ms step_avg:43.69ms
+step:14200/500000 train_loss:2.1828 train_time:620450ms step_avg:43.69ms
+step:14400/500000 train_loss:2.0819 train_time:629189ms step_avg:43.69ms
+step:14600/500000 train_loss:2.1389 train_time:637923ms step_avg:43.69ms
+step:14800/500000 train_loss:1.9175 train_time:646657ms step_avg:43.69ms
+step:15000/500000 train_loss:2.0332 train_time:655403ms step_avg:43.69ms
+step:15200/500000 train_loss:2.1406 train_time:664139ms step_avg:43.69ms
+step:15400/500000 train_loss:2.0916 train_time:672882ms step_avg:43.69ms
+step:15600/500000 train_loss:2.1547 train_time:681620ms step_avg:43.69ms
+step:15800/500000 train_loss:1.8204 train_time:690352ms step_avg:43.69ms
+step:16000/500000 train_loss:2.1014 train_time:699084ms step_avg:43.69ms
+step:16200/500000 train_loss:2.1821 train_time:707820ms step_avg:43.69ms
+step:16400/500000 train_loss:2.1894 train_time:716552ms step_avg:43.69ms
+step:16600/500000 train_loss:2.0465 train_time:725412ms step_avg:43.70ms
+step:16800/500000 train_loss:1.9392 train_time:734145ms step_avg:43.70ms
+step:17000/500000 train_loss:1.9603 train_time:742889ms step_avg:43.70ms
+step:17200/500000 train_loss:2.0424 train_time:751641ms step_avg:43.70ms
+step:17400/500000 train_loss:2.0173 train_time:760388ms step_avg:43.70ms
+step:17600/500000 train_loss:2.0272 train_time:769119ms step_avg:43.70ms
+step:17800/500000 train_loss:2.0577 train_time:777863ms step_avg:43.70ms
+step:18000/500000 train_loss:2.1233 train_time:786590ms step_avg:43.70ms
+step:18200/500000 train_loss:2.2190 train_time:795310ms step_avg:43.70ms
+step:18400/500000 train_loss:2.0528 train_time:804037ms step_avg:43.70ms
+step:18600/500000 train_loss:2.0409 train_time:812766ms step_avg:43.70ms
+step:18800/500000 train_loss:2.1861 train_time:821488ms step_avg:43.70ms
+step:19000/500000 train_loss:2.0094 train_time:830217ms step_avg:43.70ms
+step:19200/500000 train_loss:2.1793 train_time:838942ms step_avg:43.69ms
+step:19400/500000 train_loss:2.1423 train_time:847676ms step_avg:43.69ms
+step:19600/500000 train_loss:1.9673 train_time:856408ms step_avg:43.69ms
+step:19800/500000 train_loss:2.2609 train_time:865135ms step_avg:43.69ms
+step:20000/500000 train_loss:1.9800 train_time:873911ms step_avg:43.70ms
+step:20000/500000 val_loss:2.0779 val_bpb:1.2307 train_time:873923ms step_avg:43.70ms
+step:20200/500000 train_loss:2.0109 train_time:882650ms step_avg:43.70ms
+step:20400/500000 train_loss:2.0714 train_time:891387ms step_avg:43.70ms
+step:20600/500000 train_loss:2.0559 train_time:900265ms step_avg:43.70ms
+step:20800/500000 train_loss:2.0586 train_time:909008ms step_avg:43.70ms
+step:21000/500000 train_loss:2.0829 train_time:917731ms step_avg:43.70ms
+step:21200/500000 train_loss:2.0840 train_time:926484ms step_avg:43.70ms
+step:21400/500000 train_loss:2.0728 train_time:935227ms step_avg:43.70ms
+step:21600/500000 train_loss:1.9781 train_time:943962ms step_avg:43.70ms
+step:21800/500000 train_loss:2.1248 train_time:952707ms step_avg:43.70ms
+step:22000/500000 train_loss:2.0181 train_time:961450ms step_avg:43.70ms
+step:22200/500000 train_loss:1.9460 train_time:970185ms step_avg:43.70ms
+step:22400/500000 train_loss:2.1037 train_time:978935ms step_avg:43.70ms
+step:22600/500000 train_loss:2.0755 train_time:987675ms step_avg:43.70ms
+step:22800/500000 train_loss:2.0061 train_time:996415ms step_avg:43.70ms
+step:23000/500000 train_loss:2.1035 train_time:1005151ms step_avg:43.70ms
+step:23200/500000 train_loss:2.0248 train_time:1013891ms step_avg:43.70ms
+step:23400/500000 train_loss:2.0755 train_time:1022624ms step_avg:43.70ms
+step:23600/500000 train_loss:2.0633 train_time:1031363ms step_avg:43.70ms
+step:23800/500000 train_loss:2.0454 train_time:1040107ms step_avg:43.70ms
+step:24000/500000 train_loss:2.0450 train_time:1048848ms step_avg:43.70ms
+step:24200/500000 train_loss:2.1111 train_time:1057577ms step_avg:43.70ms
+step:24400/500000 train_loss:2.2038 train_time:1066306ms step_avg:43.70ms
+step:24600/500000 train_loss:2.1040 train_time:1075040ms step_avg:43.70ms
+step:24800/500000 train_loss:2.0724 train_time:1083908ms step_avg:43.71ms
+step:25000/500000 train_loss:2.0719 train_time:1092641ms step_avg:43.71ms
+step:25200/500000 train_loss:2.0243 train_time:1101373ms step_avg:43.71ms
+step:25400/500000 train_loss:1.9389 train_time:1110103ms step_avg:43.70ms
+step:25600/500000 train_loss:1.9308 train_time:1118832ms step_avg:43.70ms
+step:25800/500000 train_loss:2.0562 train_time:1127565ms step_avg:43.70ms
+step:26000/500000 train_loss:2.1770 train_time:1136291ms step_avg:43.70ms
+step:26200/500000 train_loss:2.0391 train_time:1145021ms step_avg:43.70ms
+step:26400/500000 train_loss:1.9735 train_time:1153756ms step_avg:43.70ms
+step:26600/500000 train_loss:2.0604 train_time:1162477ms step_avg:43.70ms
+step:26800/500000 train_loss:1.9815 train_time:1171216ms step_avg:43.70ms
+step:27000/500000 train_loss:2.0145 train_time:1179957ms step_avg:43.70ms
+step:27200/500000 train_loss:2.0304 train_time:1188694ms step_avg:43.70ms
+step:27400/500000 train_loss:2.0649 train_time:1197422ms step_avg:43.70ms
+step:27600/500000 train_loss:2.0879 train_time:1206157ms step_avg:43.70ms
+step:27800/500000 train_loss:2.2113 train_time:1214902ms step_avg:43.70ms
+step:28000/500000 train_loss:2.1648 train_time:1223631ms step_avg:43.70ms
+step:28200/500000 train_loss:2.0987 train_time:1232368ms step_avg:43.70ms
+step:28400/500000 train_loss:2.1582 train_time:1241115ms step_avg:43.70ms
+step:28600/500000 train_loss:2.0779 train_time:1249862ms step_avg:43.70ms
+step:28800/500000 train_loss:2.0425 train_time:1258604ms step_avg:43.70ms
+step:29000/500000 train_loss:2.1468 train_time:1267458ms step_avg:43.71ms
+step:29200/500000 train_loss:2.1703 train_time:1276186ms step_avg:43.70ms
+step:29400/500000 train_loss:2.0395 train_time:1284915ms step_avg:43.70ms
+step:29600/500000 train_loss:2.0624 train_time:1293655ms step_avg:43.70ms
+step:29800/500000 train_loss:2.0755 train_time:1302387ms step_avg:43.70ms
+step:30000/500000 train_loss:2.0016 train_time:1311121ms step_avg:43.70ms
+step:30200/500000 train_loss:2.0520 train_time:1319857ms step_avg:43.70ms
+step:30400/500000 train_loss:1.9604 train_time:1328595ms step_avg:43.70ms
+step:30600/500000 train_loss:2.0195 train_time:1337334ms step_avg:43.70ms
+step:30800/500000 train_loss:2.0386 train_time:1346076ms step_avg:43.70ms
+step:31000/500000 train_loss:2.1222 train_time:1354809ms step_avg:43.70ms
+step:31200/500000 train_loss:2.1388 train_time:1363547ms step_avg:43.70ms
+step:31400/500000 train_loss:2.1097 train_time:1372279ms step_avg:43.70ms
+step:31600/500000 train_loss:2.0476 train_time:1381026ms step_avg:43.70ms
+step:31800/500000 train_loss:2.0700 train_time:1389780ms step_avg:43.70ms
+step:32000/500000 train_loss:2.1175 train_time:1398523ms step_avg:43.70ms
+step:32200/500000 train_loss:2.0473 train_time:1407258ms step_avg:43.70ms
+step:32400/500000 train_loss:1.9902 train_time:1416002ms step_avg:43.70ms
+step:32600/500000 train_loss:2.0894 train_time:1424736ms step_avg:43.70ms
+step:32800/500000 train_loss:2.1487 train_time:1433470ms step_avg:43.70ms
+step:33000/500000 train_loss:2.0049 train_time:1442332ms step_avg:43.71ms
+step:33200/500000 train_loss:2.0618 train_time:1451061ms step_avg:43.71ms
+step:33400/500000 train_loss:1.9522 train_time:1459784ms step_avg:43.71ms
+step:33600/500000 train_loss:2.1620 train_time:1468519ms step_avg:43.71ms
+step:33800/500000 train_loss:2.0664 train_time:1477245ms step_avg:43.71ms
+step:34000/500000 train_loss:2.3050 train_time:1485969ms step_avg:43.70ms
+step:34200/500000 train_loss:2.1240 train_time:1494699ms step_avg:43.70ms
+step:34400/500000 train_loss:2.0971 train_time:1503428ms step_avg:43.70ms
+step:34600/500000 train_loss:2.1775 train_time:1512165ms step_avg:43.70ms
+step:34800/500000 train_loss:2.0770 train_time:1520903ms step_avg:43.70ms
+step:35000/500000 train_loss:2.0156 train_time:1529639ms step_avg:43.70ms
+step:35200/500000 train_loss:2.0252 train_time:1538380ms step_avg:43.70ms
+step:35400/500000 train_loss:2.0857 train_time:1547117ms step_avg:43.70ms
+step:35600/500000 train_loss:2.2012 train_time:1555857ms step_avg:43.70ms
+step:35800/500000 train_loss:2.0054 train_time:1564595ms step_avg:43.70ms
+step:36000/500000 train_loss:2.1126 train_time:1573347ms step_avg:43.70ms
+step:36200/500000 train_loss:1.9196 train_time:1582070ms step_avg:43.70ms
+step:36400/500000 train_loss:2.0816 train_time:1590801ms step_avg:43.70ms
+step:36600/500000 train_loss:2.0577 train_time:1599527ms step_avg:43.70ms
+step:36800/500000 train_loss:1.9859 train_time:1608254ms step_avg:43.70ms
+step:37000/500000 train_loss:1.9921 train_time:1616984ms step_avg:43.70ms
+step:37200/500000 train_loss:2.0478 train_time:1625821ms step_avg:43.70ms
+step:37400/500000 train_loss:2.0508 train_time:1634571ms step_avg:43.71ms
+step:37600/500000 train_loss:2.0218 train_time:1643314ms step_avg:43.71ms
+step:37800/500000 train_loss:2.1115 train_time:1652053ms step_avg:43.71ms
+step:38000/500000 train_loss:1.9950 train_time:1660788ms step_avg:43.70ms
+step:38200/500000 train_loss:2.1599 train_time:1669527ms step_avg:43.70ms
+step:38400/500000 train_loss:2.0854 train_time:1678274ms step_avg:43.71ms
+step:38600/500000 train_loss:2.0777 train_time:1687012ms step_avg:43.70ms
+step:38800/500000 train_loss:2.0381 train_time:1695754ms step_avg:43.71ms
+step:39000/500000 train_loss:1.9603 train_time:1704494ms step_avg:43.70ms
+step:39200/500000 train_loss:2.0654 train_time:1713230ms step_avg:43.70ms
+step:39400/500000 train_loss:2.0960 train_time:1721966ms step_avg:43.70ms
+step:39600/500000 train_loss:2.1160 train_time:1730705ms step_avg:43.70ms
+step:39800/500000 train_loss:2.0706 train_time:1739449ms step_avg:43.70ms
+step:40000/500000 train_loss:1.9931 train_time:1748179ms step_avg:43.70ms
+step:40000/500000 val_loss:2.0513 val_bpb:1.2149 train_time:1748195ms step_avg:43.70ms
+step:40200/500000 train_loss:2.0616 train_time:1757025ms step_avg:43.71ms
+step:40400/500000 train_loss:2.1068 train_time:1765759ms step_avg:43.71ms
+step:40600/500000 train_loss:2.1224 train_time:1774490ms step_avg:43.71ms
+step:40800/500000 train_loss:1.9901 train_time:1783221ms step_avg:43.71ms
+step:41000/500000 train_loss:2.0482 train_time:1791956ms step_avg:43.71ms
+step:41200/500000 train_loss:2.0054 train_time:1800685ms step_avg:43.71ms
+step:41400/500000 train_loss:2.1389 train_time:1809420ms step_avg:43.71ms
+step:41600/500000 train_loss:1.9084 train_time:1818157ms step_avg:43.71ms
+step:41800/500000 train_loss:1.9776 train_time:1826898ms step_avg:43.71ms
+step:42000/500000 train_loss:2.1240 train_time:1835629ms step_avg:43.71ms
+step:42200/500000 train_loss:2.0346 train_time:1844369ms step_avg:43.71ms
+step:42400/500000 train_loss:1.8743 train_time:1853109ms step_avg:43.71ms
+step:42600/500000 train_loss:2.0811 train_time:1861841ms step_avg:43.71ms
+step:42800/500000 train_loss:1.9275 train_time:1870573ms step_avg:43.70ms
+step:43000/500000 train_loss:2.0186 train_time:1879304ms step_avg:43.70ms
+step:43200/500000 train_loss:1.9877 train_time:1888040ms step_avg:43.70ms
+step:43400/500000 train_loss:2.0362 train_time:1896771ms step_avg:43.70ms
+step:43600/500000 train_loss:2.0231 train_time:1905503ms step_avg:43.70ms
+step:43800/500000 train_loss:2.0801 train_time:1914237ms step_avg:43.70ms
+step:44000/500000 train_loss:2.0663 train_time:1922975ms step_avg:43.70ms
+step:44200/500000 train_loss:2.0227 train_time:1931827ms step_avg:43.71ms
+step:44400/500000 train_loss:1.9966 train_time:1940564ms step_avg:43.71ms
+step:44600/500000 train_loss:2.0269 train_time:1949298ms step_avg:43.71ms
+step:44800/500000 train_loss:2.4033 train_time:1958041ms step_avg:43.71ms
+step:45000/500000 train_loss:2.1128 train_time:1966777ms step_avg:43.71ms
+step:45200/500000 train_loss:2.1673 train_time:1975520ms step_avg:43.71ms
+step:45400/500000 train_loss:2.1298 train_time:1984260ms step_avg:43.71ms
+step:45600/500000 train_loss:1.8693 train_time:1992998ms step_avg:43.71ms
+step:45800/500000 train_loss:2.0838 train_time:2001733ms step_avg:43.71ms
+step:46000/500000 train_loss:2.0543 train_time:2010471ms step_avg:43.71ms
+step:46200/500000 train_loss:2.1408 train_time:2019211ms step_avg:43.71ms
+step:46400/500000 train_loss:2.0154 train_time:2027949ms step_avg:43.71ms
+step:46600/500000 train_loss:2.0614 train_time:2036684ms step_avg:43.71ms
+step:46800/500000 train_loss:2.1562 train_time:2045414ms step_avg:43.71ms
+step:47000/500000 train_loss:2.0508 train_time:2054144ms step_avg:43.71ms
+step:47200/500000 train_loss:2.0838 train_time:2062882ms step_avg:43.71ms
+step:47400/500000 train_loss:2.0732 train_time:2071643ms step_avg:43.71ms
+step:47600/500000 train_loss:2.0627 train_time:2080393ms step_avg:43.71ms
+step:47800/500000 train_loss:2.0581 train_time:2089138ms step_avg:43.71ms
+step:48000/500000 train_loss:2.1821 train_time:2097874ms step_avg:43.71ms
+step:48200/500000 train_loss:2.0664 train_time:2106621ms step_avg:43.71ms
+step:48400/500000 train_loss:2.0724 train_time:2115483ms step_avg:43.71ms
+step:48600/500000 train_loss:1.9536 train_time:2124224ms step_avg:43.71ms
+step:48800/500000 train_loss:2.0122 train_time:2132956ms step_avg:43.71ms
+step:49000/500000 train_loss:2.1620 train_time:2141694ms step_avg:43.71ms
+step:49200/500000 train_loss:1.9946 train_time:2150431ms step_avg:43.71ms
+step:49400/500000 train_loss:2.0624 train_time:2159169ms step_avg:43.71ms
+step:49600/500000 train_loss:2.0001 train_time:2167906ms step_avg:43.71ms
+step:49800/500000 train_loss:2.0727 train_time:2176641ms step_avg:43.71ms
+step:50000/500000 train_loss:2.0876 train_time:2185379ms step_avg:43.71ms
+step:50200/500000 train_loss:2.0509 train_time:2194118ms step_avg:43.71ms
+step:50400/500000 train_loss:2.1277 train_time:2202852ms step_avg:43.71ms
+step:50600/500000 train_loss:2.0534 train_time:2211582ms step_avg:43.71ms
+step:50800/500000 train_loss:1.9725 train_time:2220324ms step_avg:43.71ms
+step:51000/500000 train_loss:1.9960 train_time:2229067ms step_avg:43.71ms
+step:51200/500000 train_loss:2.0345 train_time:2237807ms step_avg:43.71ms
+step:51400/500000 train_loss:2.0990 train_time:2246550ms step_avg:43.71ms
+step:51600/500000 train_loss:2.0457 train_time:2255285ms step_avg:43.71ms
+step:51800/500000 train_loss:2.0500 train_time:2264020ms step_avg:43.71ms
+step:52000/500000 train_loss:1.9877 train_time:2272760ms step_avg:43.71ms
+step:52200/500000 train_loss:1.9642 train_time:2281497ms step_avg:43.71ms
+step:52400/500000 train_loss:2.0589 train_time:2290236ms step_avg:43.71ms
+step:52600/500000 train_loss:2.0642 train_time:2299090ms step_avg:43.71ms
+step:52800/500000 train_loss:2.0002 train_time:2307829ms step_avg:43.71ms
+step:53000/500000 train_loss:2.1251 train_time:2316559ms step_avg:43.71ms
+step:53200/500000 train_loss:2.2760 train_time:2325293ms step_avg:43.71ms
+step:53400/500000 train_loss:1.9892 train_time:2334032ms step_avg:43.71ms
+step:53600/500000 train_loss:2.0399 train_time:2342765ms step_avg:43.71ms
+step:53800/500000 train_loss:2.0171 train_time:2351497ms step_avg:43.71ms
+step:54000/500000 train_loss:1.9835 train_time:2360222ms step_avg:43.71ms
+step:54200/500000 train_loss:1.9996 train_time:2368976ms step_avg:43.71ms
+step:54400/500000 train_loss:2.1281 train_time:2377711ms step_avg:43.71ms
+step:54600/500000 train_loss:2.0236 train_time:2386444ms step_avg:43.71ms
+step:54800/500000 train_loss:1.9770 train_time:2395176ms step_avg:43.71ms
+step:55000/500000 train_loss:2.1209 train_time:2403912ms step_avg:43.71ms
+step:55200/500000 train_loss:2.0261 train_time:2412637ms step_avg:43.71ms
+step:55400/500000 train_loss:2.0456 train_time:2421371ms step_avg:43.71ms
+step:55600/500000 train_loss:2.0821 train_time:2430109ms step_avg:43.71ms
+step:55800/500000 train_loss:1.9503 train_time:2438834ms step_avg:43.71ms
+step:56000/500000 train_loss:2.0074 train_time:2447565ms step_avg:43.71ms
+step:56200/500000 train_loss:2.0663 train_time:2456301ms step_avg:43.71ms
+step:56400/500000 train_loss:2.0383 train_time:2465029ms step_avg:43.71ms
+step:56600/500000 train_loss:2.0852 train_time:2473880ms step_avg:43.71ms
+step:56800/500000 train_loss:1.9910 train_time:2482615ms step_avg:43.71ms
+step:57000/500000 train_loss:2.0997 train_time:2491342ms step_avg:43.71ms
+step:57200/500000 train_loss:2.0564 train_time:2500070ms step_avg:43.71ms
+step:57400/500000 train_loss:1.9974 train_time:2508803ms step_avg:43.71ms
+step:57600/500000 train_loss:2.0217 train_time:2517529ms step_avg:43.71ms
+step:57800/500000 train_loss:1.9696 train_time:2526265ms step_avg:43.71ms
+step:58000/500000 train_loss:2.0311 train_time:2534998ms step_avg:43.71ms
+step:58200/500000 train_loss:2.0664 train_time:2543727ms step_avg:43.71ms
+step:58400/500000 train_loss:2.0663 train_time:2552468ms step_avg:43.71ms
+step:58600/500000 train_loss:1.8755 train_time:2561203ms step_avg:43.71ms
+step:58800/500000 train_loss:2.0705 train_time:2569943ms step_avg:43.71ms
+step:59000/500000 train_loss:2.1983 train_time:2578691ms step_avg:43.71ms
+step:59200/500000 train_loss:2.0254 train_time:2587421ms step_avg:43.71ms
+step:59400/500000 train_loss:1.9081 train_time:2596157ms step_avg:43.71ms
+step:59600/500000 train_loss:1.8892 train_time:2604900ms step_avg:43.71ms
+step:59800/500000 train_loss:2.0723 train_time:2613631ms step_avg:43.71ms
+step:60000/500000 train_loss:1.9396 train_time:2622368ms step_avg:43.71ms
+step:60000/500000 val_loss:2.0377 val_bpb:1.2069 train_time:2622385ms step_avg:43.71ms
+step:60200/500000 train_loss:2.0829 train_time:2631107ms step_avg:43.71ms
+step:60400/500000 train_loss:2.0247 train_time:2639847ms step_avg:43.71ms
+step:60600/500000 train_loss:2.0155 train_time:2648581ms step_avg:43.71ms
+step:60800/500000 train_loss:1.9590 train_time:2657437ms step_avg:43.71ms
+step:61000/500000 train_loss:2.0764 train_time:2666173ms step_avg:43.71ms
+step:61200/500000 train_loss:2.0025 train_time:2674908ms step_avg:43.71ms
+step:61400/500000 train_loss:2.0852 train_time:2683660ms step_avg:43.71ms
+step:61600/500000 train_loss:2.1964 train_time:2692401ms step_avg:43.71ms
+step:61800/500000 train_loss:2.1862 train_time:2701152ms step_avg:43.71ms
+step:62000/500000 train_loss:2.0954 train_time:2709889ms step_avg:43.71ms
+step:62200/500000 train_loss:2.0075 train_time:2718626ms step_avg:43.71ms
+step:62400/500000 train_loss:1.9852 train_time:2727373ms step_avg:43.71ms
+step:62600/500000 train_loss:2.0035 train_time:2736123ms step_avg:43.71ms
+step:62800/500000 train_loss:2.1441 train_time:2744854ms step_avg:43.71ms
+step:63000/500000 train_loss:1.8316 train_time:2753594ms step_avg:43.71ms
+step:63200/500000 train_loss:2.0102 train_time:2762341ms step_avg:43.71ms
+step:63400/500000 train_loss:1.8111 train_time:2771087ms step_avg:43.71ms
+step:63600/500000 train_loss:2.0566 train_time:2779840ms step_avg:43.71ms
+step:63800/500000 train_loss:2.0424 train_time:2788589ms step_avg:43.71ms
+step:64000/500000 train_loss:1.9626 train_time:2797329ms step_avg:43.71ms
+step:64200/500000 train_loss:2.0900 train_time:2806076ms step_avg:43.71ms
+step:64400/500000 train_loss:2.0560 train_time:2814809ms step_avg:43.71ms
+step:64600/500000 train_loss:1.9803 train_time:2823559ms step_avg:43.71ms
+step:64800/500000 train_loss:2.0888 train_time:2832415ms step_avg:43.71ms
+step:65000/500000 train_loss:1.9832 train_time:2841158ms step_avg:43.71ms
+step:65200/500000 train_loss:2.0433 train_time:2849898ms step_avg:43.71ms
+step:65400/500000 train_loss:1.8724 train_time:2858640ms step_avg:43.71ms
+step:65600/500000 train_loss:1.9251 train_time:2867381ms step_avg:43.71ms
+step:65800/500000 train_loss:2.0526 train_time:2876123ms step_avg:43.71ms
+step:66000/500000 train_loss:1.9819 train_time:2884855ms step_avg:43.71ms
+step:66200/500000 train_loss:2.1081 train_time:2893587ms step_avg:43.71ms
+step:66400/500000 train_loss:2.0156 train_time:2902325ms step_avg:43.71ms
+step:66600/500000 train_loss:2.0231 train_time:2911066ms step_avg:43.71ms
+step:66800/500000 train_loss:1.9735 train_time:2919799ms step_avg:43.71ms
+step:67000/500000 train_loss:1.9554 train_time:2928542ms step_avg:43.71ms
+step:67200/500000 train_loss:2.0098 train_time:2937282ms step_avg:43.71ms
+step:67400/500000 train_loss:2.1122 train_time:2946019ms step_avg:43.71ms
+step:67600/500000 train_loss:2.0330 train_time:2954754ms step_avg:43.71ms
+step:67800/500000 train_loss:1.9299 train_time:2963497ms step_avg:43.71ms
+step:68000/500000 train_loss:2.0881 train_time:2972236ms step_avg:43.71ms
+step:68200/500000 train_loss:1.9947 train_time:2980982ms step_avg:43.71ms
+step:68400/500000 train_loss:2.1097 train_time:2989720ms step_avg:43.71ms
+step:68600/500000 train_loss:2.1249 train_time:2998464ms step_avg:43.71ms
+step:68800/500000 train_loss:1.9755 train_time:3007204ms step_avg:43.71ms
+step:69000/500000 train_loss:1.9303 train_time:3016067ms step_avg:43.71ms
+step:69200/500000 train_loss:2.0941 train_time:3024806ms step_avg:43.71ms
+step:69400/500000 train_loss:2.0425 train_time:3033540ms step_avg:43.71ms
+step:69600/500000 train_loss:1.9208 train_time:3042275ms step_avg:43.71ms
+step:69800/500000 train_loss:2.0736 train_time:3051002ms step_avg:43.71ms
+step:70000/500000 train_loss:2.0506 train_time:3059740ms step_avg:43.71ms
+step:70200/500000 train_loss:1.9177 train_time:3068480ms step_avg:43.71ms
+step:70400/500000 train_loss:2.0552 train_time:3077209ms step_avg:43.71ms
+step:70600/500000 train_loss:2.1593 train_time:3085950ms step_avg:43.71ms
+step:70800/500000 train_loss:2.3957 train_time:3094684ms step_avg:43.71ms
+step:71000/500000 train_loss:2.1222 train_time:3103419ms step_avg:43.71ms
+step:71200/500000 train_loss:2.1032 train_time:3112149ms step_avg:43.71ms
+step:71400/500000 train_loss:1.9591 train_time:3120894ms step_avg:43.71ms
+step:71600/500000 train_loss:1.9592 train_time:3129636ms step_avg:43.71ms
+step:71800/500000 train_loss:2.0074 train_time:3138374ms step_avg:43.71ms
+step:72000/500000 train_loss:2.2693 train_time:3147117ms step_avg:43.71ms
+step:72200/500000 train_loss:2.1045 train_time:3155857ms step_avg:43.71ms
+step:72400/500000 train_loss:1.8356 train_time:3164598ms step_avg:43.71ms
+step:72600/500000 train_loss:2.0390 train_time:3173344ms step_avg:43.71ms
+step:72800/500000 train_loss:1.9867 train_time:3182090ms step_avg:43.71ms
+step:73000/500000 train_loss:2.0488 train_time:3190957ms step_avg:43.71ms
+step:73200/500000 train_loss:2.0355 train_time:3199692ms step_avg:43.71ms
+step:73400/500000 train_loss:1.9839 train_time:3208438ms step_avg:43.71ms
+step:73600/500000 train_loss:2.0317 train_time:3217184ms step_avg:43.71ms
+step:73800/500000 train_loss:1.9925 train_time:3225921ms step_avg:43.71ms
+step:74000/500000 train_loss:1.9522 train_time:3234658ms step_avg:43.71ms
+step:74200/500000 train_loss:2.1049 train_time:3243364ms step_avg:43.71ms
+step:74400/500000 train_loss:1.9785 train_time:3252099ms step_avg:43.71ms
+step:74600/500000 train_loss:2.0592 train_time:3260837ms step_avg:43.71ms
+step:74800/500000 train_loss:2.0489 train_time:3269587ms step_avg:43.71ms
+step:75000/500000 train_loss:2.0631 train_time:3278321ms step_avg:43.71ms
+step:75200/500000 train_loss:2.0235 train_time:3287054ms step_avg:43.71ms
+step:75400/500000 train_loss:1.8544 train_time:3295787ms step_avg:43.71ms
+step:75600/500000 train_loss:1.8574 train_time:3304514ms step_avg:43.71ms
+step:75800/500000 train_loss:2.0504 train_time:3313237ms step_avg:43.71ms
+step:76000/500000 train_loss:1.9967 train_time:3322083ms step_avg:43.71ms
+step:76200/500000 train_loss:2.0440 train_time:3330827ms step_avg:43.71ms
+step:76400/500000 train_loss:1.9953 train_time:3339574ms step_avg:43.71ms
+step:76600/500000 train_loss:2.0072 train_time:3348313ms step_avg:43.71ms
+step:76800/500000 train_loss:2.0488 train_time:3357061ms step_avg:43.71ms
+step:77000/500000 train_loss:2.0404 train_time:3365811ms step_avg:43.71ms
+step:77200/500000 train_loss:2.0290 train_time:3374553ms step_avg:43.71ms
+step:77400/500000 train_loss:1.8698 train_time:3383295ms step_avg:43.71ms
+step:77600/500000 train_loss:1.9732 train_time:3392031ms step_avg:43.71ms
+step:77800/500000 train_loss:2.0140 train_time:3400774ms step_avg:43.71ms
+step:78000/500000 train_loss:1.9584 train_time:3409520ms step_avg:43.71ms
+step:78200/500000 train_loss:2.2657 train_time:3418280ms step_avg:43.71ms
+step:78400/500000 train_loss:2.0041 train_time:3427037ms step_avg:43.71ms
+step:78600/500000 train_loss:2.0953 train_time:3435793ms step_avg:43.71ms
+step:78800/500000 train_loss:2.0426 train_time:3444542ms step_avg:43.71ms
+step:79000/500000 train_loss:2.0681 train_time:3453286ms step_avg:43.71ms
+step:79200/500000 train_loss:2.0154 train_time:3462037ms step_avg:43.71ms
+step:79400/500000 train_loss:1.9769 train_time:3470781ms step_avg:43.71ms
+step:79600/500000 train_loss:2.0938 train_time:3479530ms step_avg:43.71ms
+step:79800/500000 train_loss:2.0043 train_time:3488267ms step_avg:43.71ms
+step:80000/500000 train_loss:2.0123 train_time:3497014ms step_avg:43.71ms
+step:80000/500000 val_loss:2.0306 val_bpb:1.2026 train_time:3497030ms step_avg:43.71ms
+step:80200/500000 train_loss:1.9839 train_time:3505891ms step_avg:43.71ms
+step:80400/500000 train_loss:2.0201 train_time:3514643ms step_avg:43.71ms
+step:80600/500000 train_loss:1.9697 train_time:3523393ms step_avg:43.71ms
+step:80800/500000 train_loss:1.8849 train_time:3532145ms step_avg:43.71ms
+step:81000/500000 train_loss:2.0080 train_time:3540891ms step_avg:43.71ms
+step:81200/500000 train_loss:2.0728 train_time:3549647ms step_avg:43.71ms
+step:81400/500000 train_loss:2.1014 train_time:3558381ms step_avg:43.71ms
+step:81600/500000 train_loss:2.0568 train_time:3567112ms step_avg:43.71ms
+step:81800/500000 train_loss:2.0455 train_time:3575852ms step_avg:43.71ms
+step:82000/500000 train_loss:2.0547 train_time:3584589ms step_avg:43.71ms
+step:82200/500000 train_loss:2.0608 train_time:3593331ms step_avg:43.71ms
+step:82400/500000 train_loss:2.3063 train_time:3602075ms step_avg:43.71ms
+step:82600/500000 train_loss:2.0339 train_time:3610813ms step_avg:43.71ms
+step:82800/500000 train_loss:2.1101 train_time:3619548ms step_avg:43.71ms
+step:83000/500000 train_loss:1.9921 train_time:3628292ms step_avg:43.71ms
+step:83200/500000 train_loss:1.9948 train_time:3637032ms step_avg:43.71ms
+step:83400/500000 train_loss:2.0244 train_time:3645763ms step_avg:43.71ms
+step:83600/500000 train_loss:1.9101 train_time:3654495ms step_avg:43.71ms
+step:83800/500000 train_loss:2.0107 train_time:3663231ms step_avg:43.71ms
+step:84000/500000 train_loss:1.9378 train_time:3671965ms step_avg:43.71ms
+step:84200/500000 train_loss:2.0769 train_time:3680704ms step_avg:43.71ms
+step:84400/500000 train_loss:1.9761 train_time:3689562ms step_avg:43.72ms
+step:84600/500000 train_loss:2.0646 train_time:3698295ms step_avg:43.72ms
+step:84800/500000 train_loss:2.0266 train_time:3707034ms step_avg:43.72ms
+step:85000/500000 train_loss:2.0077 train_time:3715778ms step_avg:43.72ms
+step:85200/500000 train_loss:2.1978 train_time:3724509ms step_avg:43.71ms
+step:85400/500000 train_loss:2.1181 train_time:3733240ms step_avg:43.71ms
+step:85600/500000 train_loss:1.9731 train_time:3741988ms step_avg:43.71ms
+step:85800/500000 train_loss:1.9917 train_time:3750739ms step_avg:43.71ms
+step:86000/500000 train_loss:1.9015 train_time:3759481ms step_avg:43.71ms
+step:86200/500000 train_loss:1.9947 train_time:3768217ms step_avg:43.71ms
+step:86400/500000 train_loss:2.0555 train_time:3776958ms step_avg:43.71ms
+step:86600/500000 train_loss:1.9431 train_time:3785705ms step_avg:43.71ms
+step:86800/500000 train_loss:1.9248 train_time:3794444ms step_avg:43.71ms
+step:87000/500000 train_loss:2.2830 train_time:3803191ms step_avg:43.71ms
+step:87200/500000 train_loss:1.9020 train_time:3811936ms step_avg:43.71ms
+step:87400/500000 train_loss:2.0276 train_time:3820677ms step_avg:43.71ms
+step:87600/500000 train_loss:1.9286 train_time:3829421ms step_avg:43.71ms
+step:87800/500000 train_loss:2.0150 train_time:3838163ms step_avg:43.71ms
+step:88000/500000 train_loss:2.1064 train_time:3846914ms step_avg:43.71ms
+step:88200/500000 train_loss:1.8532 train_time:3855651ms step_avg:43.71ms
+step:88400/500000 train_loss:2.1184 train_time:3864514ms step_avg:43.72ms
+step:88600/500000 train_loss:1.9995 train_time:3873261ms step_avg:43.72ms
+step:88800/500000 train_loss:2.0633 train_time:3882020ms step_avg:43.72ms
+step:89000/500000 train_loss:2.0382 train_time:3890758ms step_avg:43.72ms
+step:89200/500000 train_loss:1.9755 train_time:3899507ms step_avg:43.72ms
+step:89400/500000 train_loss:1.9484 train_time:3908254ms step_avg:43.72ms
+step:89600/500000 train_loss:2.0261 train_time:3916999ms step_avg:43.72ms
+step:89800/500000 train_loss:1.9487 train_time:3925742ms step_avg:43.72ms
+step:90000/500000 train_loss:2.0480 train_time:3934482ms step_avg:43.72ms
+step:90200/500000 train_loss:2.1049 train_time:3943216ms step_avg:43.72ms
+step:90400/500000 train_loss:2.0033 train_time:3951955ms step_avg:43.72ms
+step:90600/500000 train_loss:2.1195 train_time:3960696ms step_avg:43.72ms
+step:90800/500000 train_loss:2.0872 train_time:3969435ms step_avg:43.72ms
+step:91000/500000 train_loss:2.0414 train_time:3978176ms step_avg:43.72ms
+step:91200/500000 train_loss:1.9898 train_time:3986934ms step_avg:43.72ms
+step:91400/500000 train_loss:2.1097 train_time:3995676ms step_avg:43.72ms
+step:91600/500000 train_loss:2.0608 train_time:4004427ms step_avg:43.72ms
+step:91800/500000 train_loss:2.0089 train_time:4013164ms step_avg:43.72ms
+step:92000/500000 train_loss:2.0163 train_time:4021901ms step_avg:43.72ms
+step:92200/500000 train_loss:2.0968 train_time:4030647ms step_avg:43.72ms
+step:92400/500000 train_loss:1.9743 train_time:4039391ms step_avg:43.72ms
+step:92600/500000 train_loss:2.0348 train_time:4048245ms step_avg:43.72ms
+step:92800/500000 train_loss:2.0199 train_time:4056989ms step_avg:43.72ms
+step:93000/500000 train_loss:2.0642 train_time:4065735ms step_avg:43.72ms
+step:93200/500000 train_loss:2.1773 train_time:4074480ms step_avg:43.72ms
+step:93400/500000 train_loss:2.0545 train_time:4083215ms step_avg:43.72ms
+step:93600/500000 train_loss:1.9643 train_time:4091955ms step_avg:43.72ms
+step:93800/500000 train_loss:2.0638 train_time:4100694ms step_avg:43.72ms
+step:94000/500000 train_loss:2.0236 train_time:4109446ms step_avg:43.72ms
+step:94200/500000 train_loss:2.0259 train_time:4118191ms step_avg:43.72ms
+step:94400/500000 train_loss:1.9855 train_time:4126935ms step_avg:43.72ms
+step:94600/500000 train_loss:2.0248 train_time:4135681ms step_avg:43.72ms
+step:94800/500000 train_loss:2.0139 train_time:4144419ms step_avg:43.72ms
+step:95000/500000 train_loss:2.1720 train_time:4153155ms step_avg:43.72ms
+step:95200/500000 train_loss:2.0017 train_time:4161900ms step_avg:43.72ms
+step:95400/500000 train_loss:2.0994 train_time:4170642ms step_avg:43.72ms
+step:95600/500000 train_loss:2.0451 train_time:4179383ms step_avg:43.72ms
+step:95800/500000 train_loss:1.9698 train_time:4188120ms step_avg:43.72ms
+step:96000/500000 train_loss:2.0817 train_time:4196853ms step_avg:43.72ms
+step:96200/500000 train_loss:2.1018 train_time:4205589ms step_avg:43.72ms
+step:96400/500000 train_loss:2.1385 train_time:4214327ms step_avg:43.72ms
+step:96600/500000 train_loss:2.0407 train_time:4223177ms step_avg:43.72ms
+step:96800/500000 train_loss:1.9939 train_time:4231910ms step_avg:43.72ms
+step:97000/500000 train_loss:2.0416 train_time:4240640ms step_avg:43.72ms
+step:97200/500000 train_loss:2.0274 train_time:4249373ms step_avg:43.72ms
+step:97400/500000 train_loss:1.8758 train_time:4258108ms step_avg:43.72ms
+step:97600/500000 train_loss:1.9845 train_time:4266837ms step_avg:43.72ms
+step:97800/500000 train_loss:2.1325 train_time:4275574ms step_avg:43.72ms
+step:98000/500000 train_loss:1.9135 train_time:4284317ms step_avg:43.72ms
+step:98200/500000 train_loss:2.1159 train_time:4293049ms step_avg:43.72ms
+step:98400/500000 train_loss:1.9857 train_time:4301781ms step_avg:43.72ms
+step:98600/500000 train_loss:1.9655 train_time:4310513ms step_avg:43.72ms
+step:98800/500000 train_loss:1.9529 train_time:4319254ms step_avg:43.72ms
+step:99000/500000 train_loss:2.0475 train_time:4327994ms step_avg:43.72ms
+step:99200/500000 train_loss:2.0054 train_time:4336730ms step_avg:43.72ms
+step:99400/500000 train_loss:1.9931 train_time:4345463ms step_avg:43.72ms
+step:99600/500000 train_loss:1.9279 train_time:4354197ms step_avg:43.72ms
+step:99800/500000 train_loss:1.9186 train_time:4362935ms step_avg:43.72ms
+step:100000/500000 train_loss:1.7548 train_time:4371686ms step_avg:43.72ms
+step:100000/500000 val_loss:2.0253 val_bpb:1.1995 train_time:4371696ms step_avg:43.72ms
+step:100200/500000 train_loss:1.9653 train_time:4380428ms step_avg:43.72ms
+step:100400/500000 train_loss:2.1490 train_time:4389172ms step_avg:43.72ms
+step:100600/500000 train_loss:2.1338 train_time:4397923ms step_avg:43.72ms
+step:100800/500000 train_loss:1.9365 train_time:4406793ms step_avg:43.72ms
+step:101000/500000 train_loss:2.1717 train_time:4415535ms step_avg:43.72ms
+step:101200/500000 train_loss:2.0397 train_time:4424281ms step_avg:43.72ms
+step:101400/500000 train_loss:2.0684 train_time:4433025ms step_avg:43.72ms
+step:101600/500000 train_loss:2.0624 train_time:4441766ms step_avg:43.72ms
+step:101800/500000 train_loss:1.9881 train_time:4450520ms step_avg:43.72ms
+step:102000/500000 train_loss:2.0370 train_time:4459258ms step_avg:43.72ms
+step:102200/500000 train_loss:2.0265 train_time:4468004ms step_avg:43.72ms
+step:102400/500000 train_loss:2.0059 train_time:4476745ms step_avg:43.72ms
+step:102600/500000 train_loss:2.1766 train_time:4485487ms step_avg:43.72ms
+step:102800/500000 train_loss:1.9531 train_time:4494235ms step_avg:43.72ms
+step:103000/500000 train_loss:2.0770 train_time:4502984ms step_avg:43.72ms
+step:103200/500000 train_loss:1.9543 train_time:4511735ms step_avg:43.72ms
+step:103400/500000 train_loss:2.0411 train_time:4520487ms step_avg:43.72ms
+step:103600/500000 train_loss:2.0500 train_time:4529227ms step_avg:43.72ms
+step:103800/500000 train_loss:1.9965 train_time:4537967ms step_avg:43.72ms
+step:104000/500000 train_loss:2.0624 train_time:4546714ms step_avg:43.72ms
+step:104200/500000 train_loss:1.9853 train_time:4555461ms step_avg:43.72ms
+step:104400/500000 train_loss:1.9747 train_time:4564204ms step_avg:43.72ms
+step:104600/500000 train_loss:2.0848 train_time:4572941ms step_avg:43.72ms
+step:104800/500000 train_loss:2.0619 train_time:4581686ms step_avg:43.72ms
+step:105000/500000 train_loss:2.0278 train_time:4590551ms step_avg:43.72ms
+step:105200/500000 train_loss:1.9722 train_time:4599291ms step_avg:43.72ms
+step:105400/500000 train_loss:2.1234 train_time:4608037ms step_avg:43.72ms
+step:105600/500000 train_loss:1.9276 train_time:4616777ms step_avg:43.72ms
+step:105800/500000 train_loss:2.1207 train_time:4625512ms step_avg:43.72ms
+step:106000/500000 train_loss:1.9724 train_time:4634245ms step_avg:43.72ms
+step:106200/500000 train_loss:1.8480 train_time:4642984ms step_avg:43.72ms
+step:106400/500000 train_loss:1.9962 train_time:4651724ms step_avg:43.72ms
+step:106600/500000 train_loss:2.0682 train_time:4660463ms step_avg:43.72ms
+step:106800/500000 train_loss:1.9596 train_time:4669202ms step_avg:43.72ms
+step:107000/500000 train_loss:1.9887 train_time:4677945ms step_avg:43.72ms
+step:107200/500000 train_loss:2.0878 train_time:4686687ms step_avg:43.72ms
+step:107400/500000 train_loss:1.9820 train_time:4695424ms step_avg:43.72ms
+step:107600/500000 train_loss:1.8612 train_time:4704155ms step_avg:43.72ms
+step:107800/500000 train_loss:2.1396 train_time:4712892ms step_avg:43.72ms
+step:108000/500000 train_loss:2.0850 train_time:4721621ms step_avg:43.72ms
+step:108200/500000 train_loss:2.0261 train_time:4730365ms step_avg:43.72ms
+step:108400/500000 train_loss:2.0620 train_time:4739111ms step_avg:43.72ms
+step:108600/500000 train_loss:1.9826 train_time:4747853ms step_avg:43.72ms
+step:108800/500000 train_loss:2.1798 train_time:4756598ms step_avg:43.72ms
+step:109000/500000 train_loss:2.0688 train_time:4765478ms step_avg:43.72ms
+step:109200/500000 train_loss:2.1944 train_time:4774228ms step_avg:43.72ms
+step:109400/500000 train_loss:2.0273 train_time:4782964ms step_avg:43.72ms
+step:109600/500000 train_loss:1.9818 train_time:4791698ms step_avg:43.72ms
+step:109800/500000 train_loss:1.8197 train_time:4800433ms step_avg:43.72ms
+step:110000/500000 train_loss:1.9632 train_time:4809167ms step_avg:43.72ms
+step:110200/500000 train_loss:2.0929 train_time:4817911ms step_avg:43.72ms
+step:110400/500000 train_loss:1.9924 train_time:4826648ms step_avg:43.72ms
+step:110600/500000 train_loss:1.9691 train_time:4835381ms step_avg:43.72ms
+step:110800/500000 train_loss:2.0186 train_time:4844110ms step_avg:43.72ms
+step:111000/500000 train_loss:2.0252 train_time:4852847ms step_avg:43.72ms
+step:111200/500000 train_loss:2.2204 train_time:4861590ms step_avg:43.72ms
+step:111400/500000 train_loss:2.0328 train_time:4870299ms step_avg:43.72ms
+step:111600/500000 train_loss:2.0633 train_time:4879034ms step_avg:43.72ms
+step:111800/500000 train_loss:2.0296 train_time:4887776ms step_avg:43.72ms
+step:112000/500000 train_loss:2.0067 train_time:4896638ms step_avg:43.72ms
+step:112200/500000 train_loss:2.0360 train_time:4905378ms step_avg:43.72ms
+step:112400/500000 train_loss:1.8986 train_time:4914115ms step_avg:43.72ms
+step:112600/500000 train_loss:1.8944 train_time:4922857ms step_avg:43.72ms
+step:112800/500000 train_loss:1.9784 train_time:4931604ms step_avg:43.72ms
+step:113000/500000 train_loss:2.0355 train_time:4940342ms step_avg:43.72ms
+step:113200/500000 train_loss:1.8432 train_time:4949080ms step_avg:43.72ms
+step:113400/500000 train_loss:2.2008 train_time:4957827ms step_avg:43.72ms
+step:113600/500000 train_loss:1.9759 train_time:4966563ms step_avg:43.72ms
+step:113800/500000 train_loss:1.8877 train_time:4975306ms step_avg:43.72ms
+step:114000/500000 train_loss:2.0328 train_time:4984057ms step_avg:43.72ms
+step:114200/500000 train_loss:1.9748 train_time:4992790ms step_avg:43.72ms
+step:114400/500000 train_loss:1.9246 train_time:5001527ms step_avg:43.72ms
+step:114600/500000 train_loss:2.1953 train_time:5010259ms step_avg:43.72ms
+step:114800/500000 train_loss:1.9761 train_time:5018992ms step_avg:43.72ms
+step:115000/500000 train_loss:1.9908 train_time:5027726ms step_avg:43.72ms
+step:115200/500000 train_loss:2.0108 train_time:5036461ms step_avg:43.72ms
+step:115400/500000 train_loss:2.0292 train_time:5045199ms step_avg:43.72ms
+step:115600/500000 train_loss:2.0888 train_time:5053924ms step_avg:43.72ms
+step:115800/500000 train_loss:2.2096 train_time:5062667ms step_avg:43.72ms
+step:116000/500000 train_loss:1.9721 train_time:5071397ms step_avg:43.72ms
+step:116200/500000 train_loss:2.0199 train_time:5080251ms step_avg:43.72ms
+step:116400/500000 train_loss:2.1796 train_time:5088980ms step_avg:43.72ms
+step:116600/500000 train_loss:1.9663 train_time:5097712ms step_avg:43.72ms
+step:116800/500000 train_loss:2.0318 train_time:5106443ms step_avg:43.72ms
+step:117000/500000 train_loss:1.9711 train_time:5115180ms step_avg:43.72ms
+step:117200/500000 train_loss:1.7761 train_time:5123922ms step_avg:43.72ms
+step:117400/500000 train_loss:2.0077 train_time:5132656ms step_avg:43.72ms
+step:117600/500000 train_loss:1.9814 train_time:5141384ms step_avg:43.72ms
+step:117800/500000 train_loss:2.1267 train_time:5150118ms step_avg:43.72ms
+step:118000/500000 train_loss:1.8956 train_time:5158844ms step_avg:43.72ms
+step:118200/500000 train_loss:1.8866 train_time:5167573ms step_avg:43.72ms
+step:118400/500000 train_loss:2.0781 train_time:5176309ms step_avg:43.72ms
+step:118600/500000 train_loss:2.3821 train_time:5185055ms step_avg:43.72ms
+step:118800/500000 train_loss:2.0305 train_time:5193784ms step_avg:43.72ms
+step:119000/500000 train_loss:1.9897 train_time:5202517ms step_avg:43.72ms
+step:119200/500000 train_loss:2.0434 train_time:5211253ms step_avg:43.72ms
+step:119400/500000 train_loss:2.0924 train_time:5220001ms step_avg:43.72ms
+step:119600/500000 train_loss:2.0761 train_time:5228736ms step_avg:43.72ms
+step:119800/500000 train_loss:2.2817 train_time:5237464ms step_avg:43.72ms
+step:120000/500000 train_loss:2.1493 train_time:5246197ms step_avg:43.72ms
+step:120000/500000 val_loss:2.0210 val_bpb:1.1969 train_time:5246213ms step_avg:43.72ms
+step:120200/500000 train_loss:1.9144 train_time:5255050ms step_avg:43.72ms
+step:120400/500000 train_loss:2.0053 train_time:5263788ms step_avg:43.72ms
+step:120600/500000 train_loss:1.9489 train_time:5272526ms step_avg:43.72ms
+step:120800/500000 train_loss:1.9367 train_time:5281265ms step_avg:43.72ms
+step:121000/500000 train_loss:1.9876 train_time:5289998ms step_avg:43.72ms
+step:121200/500000 train_loss:2.0388 train_time:5298737ms step_avg:43.72ms
+step:121400/500000 train_loss:1.9117 train_time:5307477ms step_avg:43.72ms
+step:121600/500000 train_loss:2.1033 train_time:5316209ms step_avg:43.72ms
+step:121800/500000 train_loss:1.9801 train_time:5324944ms step_avg:43.72ms
+step:122000/500000 train_loss:1.9211 train_time:5333677ms step_avg:43.72ms
+step:122200/500000 train_loss:1.9922 train_time:5342421ms step_avg:43.72ms
+step:122400/500000 train_loss:2.0145 train_time:5351162ms step_avg:43.72ms
+step:122600/500000 train_loss:1.9822 train_time:5359901ms step_avg:43.72ms
+step:122800/500000 train_loss:1.9503 train_time:5368632ms step_avg:43.72ms
+step:123000/500000 train_loss:2.0314 train_time:5377369ms step_avg:43.72ms
+step:123200/500000 train_loss:2.0501 train_time:5386094ms step_avg:43.72ms
+step:123400/500000 train_loss:2.0650 train_time:5394818ms step_avg:43.72ms
+step:123600/500000 train_loss:2.0346 train_time:5403547ms step_avg:43.72ms
+step:123800/500000 train_loss:1.9980 train_time:5412273ms step_avg:43.72ms
+step:124000/500000 train_loss:1.8607 train_time:5421006ms step_avg:43.72ms
+step:124200/500000 train_loss:2.0196 train_time:5429737ms step_avg:43.72ms
+step:124400/500000 train_loss:1.9967 train_time:5438575ms step_avg:43.72ms
+step:124600/500000 train_loss:2.0685 train_time:5447313ms step_avg:43.72ms
+step:124800/500000 train_loss:2.0213 train_time:5456041ms step_avg:43.72ms
+step:125000/500000 train_loss:2.0980 train_time:5464778ms step_avg:43.72ms
+step:125200/500000 train_loss:1.9944 train_time:5473524ms step_avg:43.72ms
+step:125400/500000 train_loss:2.0300 train_time:5482273ms step_avg:43.72ms
+step:125600/500000 train_loss:2.0471 train_time:5491013ms step_avg:43.72ms
+step:125800/500000 train_loss:2.0354 train_time:5499761ms step_avg:43.72ms
+step:126000/500000 train_loss:2.0454 train_time:5508491ms step_avg:43.72ms
+step:126200/500000 train_loss:1.9727 train_time:5517233ms step_avg:43.72ms
+step:126400/500000 train_loss:2.1303 train_time:5525963ms step_avg:43.72ms
+step:126600/500000 train_loss:2.1323 train_time:5534702ms step_avg:43.72ms
+step:126800/500000 train_loss:2.0353 train_time:5543447ms step_avg:43.72ms
+step:127000/500000 train_loss:2.0481 train_time:5552191ms step_avg:43.72ms
+step:127200/500000 train_loss:1.9966 train_time:5560928ms step_avg:43.72ms
+step:127400/500000 train_loss:2.0024 train_time:5569671ms step_avg:43.72ms
+step:127600/500000 train_loss:2.1352 train_time:5578409ms step_avg:43.72ms
+step:127800/500000 train_loss:1.9181 train_time:5587143ms step_avg:43.72ms
+step:128000/500000 train_loss:2.0442 train_time:5595877ms step_avg:43.72ms
+step:128200/500000 train_loss:1.9795 train_time:5604623ms step_avg:43.72ms
+step:128400/500000 train_loss:2.0676 train_time:5613487ms step_avg:43.72ms
+step:128600/500000 train_loss:1.9502 train_time:5622229ms step_avg:43.72ms
+step:128800/500000 train_loss:1.9356 train_time:5630963ms step_avg:43.72ms
+step:129000/500000 train_loss:1.8350 train_time:5639707ms step_avg:43.72ms
+step:129200/500000 train_loss:1.9958 train_time:5648439ms step_avg:43.72ms
+step:129400/500000 train_loss:1.7850 train_time:5657189ms step_avg:43.72ms
+step:129600/500000 train_loss:2.1434 train_time:5665932ms step_avg:43.72ms
+step:129800/500000 train_loss:1.9649 train_time:5674682ms step_avg:43.72ms
+step:130000/500000 train_loss:2.0598 train_time:5683416ms step_avg:43.72ms
+step:130200/500000 train_loss:2.0008 train_time:5692156ms step_avg:43.72ms
+step:130400/500000 train_loss:2.1846 train_time:5700896ms step_avg:43.72ms
+step:130600/500000 train_loss:2.1258 train_time:5709644ms step_avg:43.72ms
+step:130800/500000 train_loss:2.0358 train_time:5718386ms step_avg:43.72ms
+step:131000/500000 train_loss:1.9617 train_time:5727113ms step_avg:43.72ms
+step:131200/500000 train_loss:1.9527 train_time:5735854ms step_avg:43.72ms
+step:131400/500000 train_loss:1.9986 train_time:5744605ms step_avg:43.72ms
+step:131600/500000 train_loss:1.8493 train_time:5753346ms step_avg:43.72ms
+step:131800/500000 train_loss:2.0235 train_time:5762090ms step_avg:43.72ms
+step:132000/500000 train_loss:1.9611 train_time:5770826ms step_avg:43.72ms
+step:132200/500000 train_loss:2.0716 train_time:5779577ms step_avg:43.72ms
+step:132400/500000 train_loss:2.0732 train_time:5788321ms step_avg:43.72ms
+step:132600/500000 train_loss:1.9754 train_time:5797192ms step_avg:43.72ms
+step:132800/500000 train_loss:2.0154 train_time:5805936ms step_avg:43.72ms
+step:133000/500000 train_loss:1.9978 train_time:5814679ms step_avg:43.72ms
+step:133200/500000 train_loss:2.1977 train_time:5823425ms step_avg:43.72ms
+step:133400/500000 train_loss:2.0363 train_time:5832175ms step_avg:43.72ms
+step:133600/500000 train_loss:1.8937 train_time:5840923ms step_avg:43.72ms
+step:133800/500000 train_loss:1.9849 train_time:5849669ms step_avg:43.72ms
+step:134000/500000 train_loss:2.1667 train_time:5858407ms step_avg:43.72ms
+step:134200/500000 train_loss:2.1546 train_time:5867169ms step_avg:43.72ms
+step:134400/500000 train_loss:2.1969 train_time:5875915ms step_avg:43.72ms
+step:134600/500000 train_loss:1.9611 train_time:5884654ms step_avg:43.72ms
+step:134800/500000 train_loss:1.9517 train_time:5893390ms step_avg:43.72ms
+step:135000/500000 train_loss:2.0010 train_time:5902136ms step_avg:43.72ms
+step:135200/500000 train_loss:1.9772 train_time:5910876ms step_avg:43.72ms
+step:135400/500000 train_loss:2.1341 train_time:5919617ms step_avg:43.72ms
+step:135600/500000 train_loss:2.1205 train_time:5928363ms step_avg:43.72ms
+step:135800/500000 train_loss:2.0861 train_time:5937104ms step_avg:43.72ms
+step:136000/500000 train_loss:1.9799 train_time:5945849ms step_avg:43.72ms
+step:136200/500000 train_loss:1.9827 train_time:5954598ms step_avg:43.72ms
+step:136400/500000 train_loss:2.0416 train_time:5963337ms step_avg:43.72ms
+step:136600/500000 train_loss:1.9804 train_time:5972081ms step_avg:43.72ms
+step:136800/500000 train_loss:2.0562 train_time:5980950ms step_avg:43.72ms
+step:137000/500000 train_loss:1.9513 train_time:5989697ms step_avg:43.72ms
+step:137200/500000 train_loss:2.0339 train_time:5998445ms step_avg:43.72ms
+step:137400/500000 train_loss:2.0287 train_time:6007174ms step_avg:43.72ms
+step:137600/500000 train_loss:1.9576 train_time:6015910ms step_avg:43.72ms
+step:137800/500000 train_loss:2.0302 train_time:6024660ms step_avg:43.72ms
+step:138000/500000 train_loss:2.0371 train_time:6033400ms step_avg:43.72ms
+step:138200/500000 train_loss:1.8320 train_time:6042150ms step_avg:43.72ms
+step:138400/500000 train_loss:2.0225 train_time:6050887ms step_avg:43.72ms
+step:138600/500000 train_loss:1.8439 train_time:6059629ms step_avg:43.72ms
+step:138800/500000 train_loss:2.1438 train_time:6068362ms step_avg:43.72ms
+step:139000/500000 train_loss:1.9947 train_time:6077102ms step_avg:43.72ms
+step:139200/500000 train_loss:2.1038 train_time:6085847ms step_avg:43.72ms
+step:139400/500000 train_loss:2.0055 train_time:6094592ms step_avg:43.72ms
+step:139600/500000 train_loss:1.9556 train_time:6103326ms step_avg:43.72ms
+step:139800/500000 train_loss:1.9420 train_time:6112064ms step_avg:43.72ms
+step:140000/500000 train_loss:1.9340 train_time:6120801ms step_avg:43.72ms
+step:140000/500000 val_loss:2.0168 val_bpb:1.1945 train_time:6120818ms step_avg:43.72ms
+step:140200/500000 train_loss:2.0384 train_time:6129542ms step_avg:43.72ms
+step:140400/500000 train_loss:1.9519 train_time:6138277ms step_avg:43.72ms
+step:140600/500000 train_loss:1.9898 train_time:6147017ms step_avg:43.72ms
+step:140800/500000 train_loss:2.2550 train_time:6155876ms step_avg:43.72ms
+step:141000/500000 train_loss:1.9867 train_time:6164617ms step_avg:43.72ms
+step:141200/500000 train_loss:2.0205 train_time:6173359ms step_avg:43.72ms
+step:141400/500000 train_loss:1.8807 train_time:6182099ms step_avg:43.72ms
+step:141600/500000 train_loss:1.9898 train_time:6190842ms step_avg:43.72ms
+step:141800/500000 train_loss:1.8427 train_time:6199583ms step_avg:43.72ms
+step:142000/500000 train_loss:2.1160 train_time:6208323ms step_avg:43.72ms
+step:142200/500000 train_loss:1.8754 train_time:6217067ms step_avg:43.72ms
+step:142400/500000 train_loss:2.0096 train_time:6225813ms step_avg:43.72ms
+step:142600/500000 train_loss:1.9781 train_time:6234549ms step_avg:43.72ms
+step:142800/500000 train_loss:2.0586 train_time:6243289ms step_avg:43.72ms
+step:143000/500000 train_loss:2.1796 train_time:6252012ms step_avg:43.72ms
+step:143200/500000 train_loss:2.0443 train_time:6260744ms step_avg:43.72ms
+step:143400/500000 train_loss:2.0865 train_time:6269501ms step_avg:43.72ms
+step:143600/500000 train_loss:2.0975 train_time:6278235ms step_avg:43.72ms
+step:143800/500000 train_loss:2.0190 train_time:6286972ms step_avg:43.72ms
+step:144000/500000 train_loss:2.0697 train_time:6295708ms step_avg:43.72ms
+step:144200/500000 train_loss:1.9661 train_time:6304436ms step_avg:43.72ms
+step:144400/500000 train_loss:2.0613 train_time:6313163ms step_avg:43.72ms
+step:144600/500000 train_loss:2.2161 train_time:6321898ms step_avg:43.72ms
+step:144800/500000 train_loss:2.0155 train_time:6330626ms step_avg:43.72ms
+step:145000/500000 train_loss:2.0206 train_time:6339481ms step_avg:43.72ms
+step:145200/500000 train_loss:2.0652 train_time:6348218ms step_avg:43.72ms
+step:145400/500000 train_loss:2.0265 train_time:6356943ms step_avg:43.72ms
+step:145600/500000 train_loss:2.0159 train_time:6365669ms step_avg:43.72ms
+step:145800/500000 train_loss:1.8650 train_time:6374392ms step_avg:43.72ms
+step:146000/500000 train_loss:2.0364 train_time:6383124ms step_avg:43.72ms
+step:146200/500000 train_loss:2.0379 train_time:6391852ms step_avg:43.72ms
+step:146400/500000 train_loss:1.9428 train_time:6400597ms step_avg:43.72ms
+step:146600/500000 train_loss:1.9223 train_time:6409330ms step_avg:43.72ms
+step:146800/500000 train_loss:2.0064 train_time:6418075ms step_avg:43.72ms
+step:147000/500000 train_loss:2.0725 train_time:6426812ms step_avg:43.72ms
+step:147200/500000 train_loss:1.9570 train_time:6435542ms step_avg:43.72ms
+step:147400/500000 train_loss:2.0533 train_time:6444273ms step_avg:43.72ms
+step:147600/500000 train_loss:2.0055 train_time:6453012ms step_avg:43.72ms
+step:147800/500000 train_loss:2.0093 train_time:6461744ms step_avg:43.72ms
+step:148000/500000 train_loss:1.9966 train_time:6470481ms step_avg:43.72ms
+step:148200/500000 train_loss:1.9797 train_time:6479223ms step_avg:43.72ms
+step:148400/500000 train_loss:2.0464 train_time:6487966ms step_avg:43.72ms
+step:148600/500000 train_loss:1.9383 train_time:6496787ms step_avg:43.72ms
+step:148800/500000 train_loss:2.1146 train_time:6505531ms step_avg:43.72ms
+step:149000/500000 train_loss:2.0930 train_time:6514273ms step_avg:43.72ms
+step:149200/500000 train_loss:2.0785 train_time:6523008ms step_avg:43.72ms
+step:149400/500000 train_loss:2.0231 train_time:6531742ms step_avg:43.72ms
+step:149600/500000 train_loss:2.2172 train_time:6540477ms step_avg:43.72ms
+step:149800/500000 train_loss:2.0192 train_time:6549214ms step_avg:43.72ms
+step:150000/500000 train_loss:2.0543 train_time:6557949ms step_avg:43.72ms
+step:150200/500000 train_loss:1.9361 train_time:6566691ms step_avg:43.72ms
+step:150400/500000 train_loss:2.0412 train_time:6575420ms step_avg:43.72ms
+step:150600/500000 train_loss:1.9852 train_time:6584160ms step_avg:43.72ms
+step:150800/500000 train_loss:2.0049 train_time:6592900ms step_avg:43.72ms
+step:151000/500000 train_loss:2.0040 train_time:6601638ms step_avg:43.72ms
+step:151200/500000 train_loss:2.1101 train_time:6610372ms step_avg:43.72ms
+step:151400/500000 train_loss:2.0910 train_time:6619105ms step_avg:43.72ms
+step:151600/500000 train_loss:2.0578 train_time:6627844ms step_avg:43.72ms
+step:151800/500000 train_loss:2.0207 train_time:6636581ms step_avg:43.72ms
+step:152000/500000 train_loss:2.0865 train_time:6645440ms step_avg:43.72ms
+step:152200/500000 train_loss:2.0833 train_time:6654166ms step_avg:43.72ms
+step:152400/500000 train_loss:2.1092 train_time:6662895ms step_avg:43.72ms
+step:152600/500000 train_loss:1.9655 train_time:6671624ms step_avg:43.72ms
+step:152800/500000 train_loss:1.9758 train_time:6680366ms step_avg:43.72ms
+step:153000/500000 train_loss:2.2251 train_time:6689098ms step_avg:43.72ms
+step:153200/500000 train_loss:2.0474 train_time:6697837ms step_avg:43.72ms
+step:153400/500000 train_loss:1.9763 train_time:6706572ms step_avg:43.72ms
+step:153600/500000 train_loss:1.8641 train_time:6715312ms step_avg:43.72ms
+step:153800/500000 train_loss:2.0611 train_time:6724045ms step_avg:43.72ms
+step:154000/500000 train_loss:2.0526 train_time:6732775ms step_avg:43.72ms
+step:154200/500000 train_loss:2.1270 train_time:6741515ms step_avg:43.72ms
+step:154400/500000 train_loss:1.9840 train_time:6750256ms step_avg:43.72ms
+step:154600/500000 train_loss:2.0056 train_time:6758997ms step_avg:43.72ms
+step:154800/500000 train_loss:2.0772 train_time:6767752ms step_avg:43.72ms
+step:155000/500000 train_loss:2.1605 train_time:6776484ms step_avg:43.72ms
+step:155200/500000 train_loss:1.9979 train_time:6785227ms step_avg:43.72ms
+step:155400/500000 train_loss:2.0597 train_time:6793955ms step_avg:43.72ms
+step:155600/500000 train_loss:1.9873 train_time:6802697ms step_avg:43.72ms
+step:155800/500000 train_loss:1.9067 train_time:6811437ms step_avg:43.72ms
+step:156000/500000 train_loss:2.0082 train_time:6820174ms step_avg:43.72ms
+step:156200/500000 train_loss:1.8543 train_time:6829025ms step_avg:43.72ms
+step:156400/500000 train_loss:2.0477 train_time:6837757ms step_avg:43.72ms
+step:156600/500000 train_loss:2.0584 train_time:6846488ms step_avg:43.72ms
+step:156800/500000 train_loss:2.0802 train_time:6855230ms step_avg:43.72ms
+step:157000/500000 train_loss:2.0046 train_time:6863969ms step_avg:43.72ms
+step:157200/500000 train_loss:2.0375 train_time:6872709ms step_avg:43.72ms
+step:157400/500000 train_loss:2.1092 train_time:6881444ms step_avg:43.72ms
+step:157600/500000 train_loss:1.9823 train_time:6890182ms step_avg:43.72ms
+step:157800/500000 train_loss:2.0278 train_time:6898919ms step_avg:43.72ms
+step:158000/500000 train_loss:2.0222 train_time:6907648ms step_avg:43.72ms
+step:158200/500000 train_loss:1.9798 train_time:6916376ms step_avg:43.72ms
+step:158400/500000 train_loss:1.9678 train_time:6925111ms step_avg:43.72ms
+step:158600/500000 train_loss:2.0966 train_time:6933843ms step_avg:43.72ms
+step:158800/500000 train_loss:1.9167 train_time:6942577ms step_avg:43.72ms
+step:159000/500000 train_loss:1.9557 train_time:6951310ms step_avg:43.72ms
+step:159200/500000 train_loss:1.9954 train_time:6960046ms step_avg:43.72ms
+step:159400/500000 train_loss:1.9841 train_time:6968776ms step_avg:43.72ms
+step:159600/500000 train_loss:2.0950 train_time:6977509ms step_avg:43.72ms
+step:159800/500000 train_loss:2.0244 train_time:6986242ms step_avg:43.72ms
+step:160000/500000 train_loss:1.9207 train_time:6994982ms step_avg:43.72ms
+step:160000/500000 val_loss:2.0151 val_bpb:1.1935 train_time:6994998ms step_avg:43.72ms
+step:160200/500000 train_loss:2.0018 train_time:7003716ms step_avg:43.72ms
+step:160400/500000 train_loss:2.0475 train_time:7012579ms step_avg:43.72ms
+step:160600/500000 train_loss:1.9469 train_time:7021324ms step_avg:43.72ms
+step:160800/500000 train_loss:1.9277 train_time:7030059ms step_avg:43.72ms
+step:161000/500000 train_loss:1.9321 train_time:7038802ms step_avg:43.72ms
+step:161200/500000 train_loss:1.9661 train_time:7047535ms step_avg:43.72ms
+step:161400/500000 train_loss:2.0205 train_time:7056270ms step_avg:43.72ms
+step:161600/500000 train_loss:1.9691 train_time:7065010ms step_avg:43.72ms
+step:161800/500000 train_loss:1.9580 train_time:7073749ms step_avg:43.72ms
+step:162000/500000 train_loss:2.1239 train_time:7082482ms step_avg:43.72ms
+step:162200/500000 train_loss:2.1069 train_time:7091216ms step_avg:43.72ms
+step:162400/500000 train_loss:1.9907 train_time:7099955ms step_avg:43.72ms
+step:162600/500000 train_loss:2.0223 train_time:7108693ms step_avg:43.72ms
+step:162800/500000 train_loss:1.9429 train_time:7117432ms step_avg:43.72ms
+step:163000/500000 train_loss:2.0619 train_time:7126171ms step_avg:43.72ms
+step:163200/500000 train_loss:1.9096 train_time:7134910ms step_avg:43.72ms
+step:163400/500000 train_loss:1.8932 train_time:7143647ms step_avg:43.72ms
+step:163600/500000 train_loss:2.0096 train_time:7152383ms step_avg:43.72ms
+step:163800/500000 train_loss:1.9822 train_time:7161125ms step_avg:43.72ms
+step:164000/500000 train_loss:1.8641 train_time:7169864ms step_avg:43.72ms
+step:164200/500000 train_loss:2.0342 train_time:7178599ms step_avg:43.72ms
+step:164400/500000 train_loss:1.9387 train_time:7187462ms step_avg:43.72ms
+step:164600/500000 train_loss:2.0189 train_time:7196201ms step_avg:43.72ms
+step:164800/500000 train_loss:1.9743 train_time:7204932ms step_avg:43.72ms
+step:165000/500000 train_loss:2.0910 train_time:7213663ms step_avg:43.72ms
+step:165200/500000 train_loss:2.0131 train_time:7222396ms step_avg:43.72ms
+step:165400/500000 train_loss:1.8053 train_time:7231130ms step_avg:43.72ms
+step:165600/500000 train_loss:2.1171 train_time:7239864ms step_avg:43.72ms
+step:165800/500000 train_loss:2.0570 train_time:7248591ms step_avg:43.72ms
+step:166000/500000 train_loss:2.1153 train_time:7257330ms step_avg:43.72ms
+step:166200/500000 train_loss:2.0100 train_time:7266063ms step_avg:43.72ms
+step:166400/500000 train_loss:1.9842 train_time:7274794ms step_avg:43.72ms
+step:166600/500000 train_loss:2.0044 train_time:7283530ms step_avg:43.72ms
+step:166800/500000 train_loss:1.9264 train_time:7292273ms step_avg:43.72ms
+step:167000/500000 train_loss:2.0284 train_time:7301000ms step_avg:43.72ms
+step:167200/500000 train_loss:2.0714 train_time:7309735ms step_avg:43.72ms
+step:167400/500000 train_loss:1.8683 train_time:7318470ms step_avg:43.72ms
+step:167600/500000 train_loss:2.0481 train_time:7327205ms step_avg:43.72ms
+step:167800/500000 train_loss:2.0047 train_time:7335945ms step_avg:43.72ms
+step:168000/500000 train_loss:2.0068 train_time:7344682ms step_avg:43.72ms
+step:168200/500000 train_loss:1.9203 train_time:7353418ms step_avg:43.72ms
+step:168400/500000 train_loss:1.9711 train_time:7362156ms step_avg:43.72ms
+step:168600/500000 train_loss:2.0499 train_time:7371013ms step_avg:43.72ms
+step:168800/500000 train_loss:2.0193 train_time:7379753ms step_avg:43.72ms
+step:169000/500000 train_loss:1.9500 train_time:7388495ms step_avg:43.72ms
+step:169200/500000 train_loss:1.9424 train_time:7397235ms step_avg:43.72ms
+step:169400/500000 train_loss:2.0937 train_time:7405974ms step_avg:43.72ms
+step:169600/500000 train_loss:2.0015 train_time:7414717ms step_avg:43.72ms
+step:169800/500000 train_loss:1.9852 train_time:7423454ms step_avg:43.72ms
+step:170000/500000 train_loss:1.8737 train_time:7432193ms step_avg:43.72ms
+step:170200/500000 train_loss:2.0264 train_time:7440931ms step_avg:43.72ms
+step:170400/500000 train_loss:2.1094 train_time:7449668ms step_avg:43.72ms
+step:170600/500000 train_loss:1.9665 train_time:7458399ms step_avg:43.72ms
+step:170800/500000 train_loss:1.9582 train_time:7467133ms step_avg:43.72ms
+step:171000/500000 train_loss:1.9925 train_time:7475896ms step_avg:43.72ms
+step:171200/500000 train_loss:2.1872 train_time:7484633ms step_avg:43.72ms
+step:171400/500000 train_loss:2.0731 train_time:7493380ms step_avg:43.72ms
+step:171600/500000 train_loss:1.9710 train_time:7502128ms step_avg:43.72ms
+step:171800/500000 train_loss:1.9976 train_time:7510868ms step_avg:43.72ms
+step:172000/500000 train_loss:1.9109 train_time:7519611ms step_avg:43.72ms
+step:172200/500000 train_loss:1.9541 train_time:7528366ms step_avg:43.72ms
+step:172400/500000 train_loss:2.0850 train_time:7537121ms step_avg:43.72ms
+step:172600/500000 train_loss:1.8652 train_time:7545995ms step_avg:43.72ms
+step:172800/500000 train_loss:1.9614 train_time:7554747ms step_avg:43.72ms
+step:173000/500000 train_loss:2.1891 train_time:7563490ms step_avg:43.72ms
+step:173200/500000 train_loss:1.9574 train_time:7572235ms step_avg:43.72ms
+step:173400/500000 train_loss:2.0368 train_time:7580959ms step_avg:43.72ms
+step:173600/500000 train_loss:2.0617 train_time:7589684ms step_avg:43.72ms
+step:173800/500000 train_loss:2.1104 train_time:7598407ms step_avg:43.72ms
+step:174000/500000 train_loss:2.0506 train_time:7607136ms step_avg:43.72ms
+step:174200/500000 train_loss:1.9299 train_time:7615871ms step_avg:43.72ms
+step:174400/500000 train_loss:2.1104 train_time:7624607ms step_avg:43.72ms
+step:174600/500000 train_loss:2.1805 train_time:7633333ms step_avg:43.72ms
+step:174800/500000 train_loss:2.0631 train_time:7642073ms step_avg:43.72ms
+step:175000/500000 train_loss:2.1022 train_time:7650819ms step_avg:43.72ms
+step:175200/500000 train_loss:1.8115 train_time:7659557ms step_avg:43.72ms
+step:175400/500000 train_loss:2.0692 train_time:7668291ms step_avg:43.72ms
+step:175600/500000 train_loss:2.1445 train_time:7677029ms step_avg:43.72ms
+step:175800/500000 train_loss:1.9551 train_time:7685758ms step_avg:43.72ms
+step:176000/500000 train_loss:1.9635 train_time:7694493ms step_avg:43.72ms
+step:176200/500000 train_loss:1.9418 train_time:7703234ms step_avg:43.72ms
+step:176400/500000 train_loss:2.1112 train_time:7711975ms step_avg:43.72ms
+step:176600/500000 train_loss:2.1513 train_time:7720710ms step_avg:43.72ms
+step:176800/500000 train_loss:2.0037 train_time:7729571ms step_avg:43.72ms
+step:177000/500000 train_loss:1.9129 train_time:7738307ms step_avg:43.72ms
+step:177200/500000 train_loss:2.1441 train_time:7747046ms step_avg:43.72ms
+step:177400/500000 train_loss:1.9549 train_time:7755780ms step_avg:43.72ms
+step:177600/500000 train_loss:1.9830 train_time:7764513ms step_avg:43.72ms
+step:177800/500000 train_loss:1.9886 train_time:7773254ms step_avg:43.72ms
+step:178000/500000 train_loss:2.1259 train_time:7781991ms step_avg:43.72ms
+step:178200/500000 train_loss:1.9354 train_time:7790727ms step_avg:43.72ms
+step:178400/500000 train_loss:2.1140 train_time:7799471ms step_avg:43.72ms
+step:178600/500000 train_loss:1.8590 train_time:7808207ms step_avg:43.72ms
+step:178800/500000 train_loss:2.0454 train_time:7816945ms step_avg:43.72ms
+step:179000/500000 train_loss:1.9407 train_time:7825676ms step_avg:43.72ms
+step:179200/500000 train_loss:1.9950 train_time:7834409ms step_avg:43.72ms
+step:179400/500000 train_loss:1.9914 train_time:7843141ms step_avg:43.72ms
+step:179600/500000 train_loss:2.0901 train_time:7851877ms step_avg:43.72ms
+step:179800/500000 train_loss:1.9138 train_time:7860608ms step_avg:43.72ms
+step:180000/500000 train_loss:2.0217 train_time:7869340ms step_avg:43.72ms
+step:180000/500000 val_loss:2.0117 val_bpb:1.1914 train_time:7869356ms step_avg:43.72ms
+step:180200/500000 train_loss:2.0164 train_time:7878077ms step_avg:43.72ms
+step:180400/500000 train_loss:2.0996 train_time:7886806ms step_avg:43.72ms
+step:180600/500000 train_loss:2.1237 train_time:7895545ms step_avg:43.72ms
+step:180800/500000 train_loss:2.0568 train_time:7904280ms step_avg:43.72ms
+step:181000/500000 train_loss:2.0839 train_time:7913142ms step_avg:43.72ms
+step:181200/500000 train_loss:1.9792 train_time:7921880ms step_avg:43.72ms
+step:181400/500000 train_loss:1.8169 train_time:7930628ms step_avg:43.72ms
+step:181600/500000 train_loss:2.0093 train_time:7939374ms step_avg:43.72ms
+step:181800/500000 train_loss:1.8723 train_time:7948111ms step_avg:43.72ms
+step:182000/500000 train_loss:2.0589 train_time:7956844ms step_avg:43.72ms
+step:182200/500000 train_loss:2.2871 train_time:7965596ms step_avg:43.72ms
+step:182400/500000 train_loss:1.9727 train_time:7974326ms step_avg:43.72ms
+step:182600/500000 train_loss:2.1237 train_time:7983068ms step_avg:43.72ms
+step:182800/500000 train_loss:1.9695 train_time:7991805ms step_avg:43.72ms
+step:183000/500000 train_loss:1.9939 train_time:8000541ms step_avg:43.72ms
+step:183200/500000 train_loss:2.1586 train_time:8009273ms step_avg:43.72ms
+step:183400/500000 train_loss:1.9795 train_time:8018006ms step_avg:43.72ms
+step:183600/500000 train_loss:2.0676 train_time:8026744ms step_avg:43.72ms
+step:183800/500000 train_loss:1.9571 train_time:8035482ms step_avg:43.72ms
+step:184000/500000 train_loss:2.0105 train_time:8044212ms step_avg:43.72ms
+step:184200/500000 train_loss:2.0779 train_time:8052949ms step_avg:43.72ms
+step:184400/500000 train_loss:2.0069 train_time:8061687ms step_avg:43.72ms
+step:184600/500000 train_loss:2.0897 train_time:8070420ms step_avg:43.72ms
+step:184800/500000 train_loss:1.9466 train_time:8079148ms step_avg:43.72ms
+step:185000/500000 train_loss:2.0431 train_time:8088000ms step_avg:43.72ms
+step:185200/500000 train_loss:2.0700 train_time:8096735ms step_avg:43.72ms
+step:185400/500000 train_loss:1.8829 train_time:8105470ms step_avg:43.72ms
+step:185600/500000 train_loss:2.0099 train_time:8114181ms step_avg:43.72ms
+step:185800/500000 train_loss:1.9734 train_time:8122913ms step_avg:43.72ms
+step:186000/500000 train_loss:2.0107 train_time:8131657ms step_avg:43.72ms
+step:186200/500000 train_loss:1.8863 train_time:8140408ms step_avg:43.72ms
+step:186400/500000 train_loss:1.8992 train_time:8149145ms step_avg:43.72ms
+step:186600/500000 train_loss:2.0194 train_time:8157881ms step_avg:43.72ms
+step:186800/500000 train_loss:2.0629 train_time:8166619ms step_avg:43.72ms
+step:187000/500000 train_loss:2.0338 train_time:8175361ms step_avg:43.72ms
+step:187200/500000 train_loss:2.0185 train_time:8184101ms step_avg:43.72ms
+step:187400/500000 train_loss:2.1287 train_time:8192842ms step_avg:43.72ms
+step:187600/500000 train_loss:1.9641 train_time:8201582ms step_avg:43.72ms
+step:187800/500000 train_loss:1.8271 train_time:8210327ms step_avg:43.72ms
+step:188000/500000 train_loss:1.8761 train_time:8219195ms step_avg:43.72ms
+step:188200/500000 train_loss:2.0407 train_time:8227939ms step_avg:43.72ms
+step:188400/500000 train_loss:1.9929 train_time:8236687ms step_avg:43.72ms
+step:188600/500000 train_loss:2.1855 train_time:8245431ms step_avg:43.72ms
+step:188800/500000 train_loss:1.9462 train_time:8254187ms step_avg:43.72ms
+step:189000/500000 train_loss:2.0223 train_time:8262930ms step_avg:43.72ms
+step:189200/500000 train_loss:2.0223 train_time:8271678ms step_avg:43.72ms
+step:189400/500000 train_loss:1.9818 train_time:8280415ms step_avg:43.72ms
+step:189600/500000 train_loss:1.9641 train_time:8289171ms step_avg:43.72ms
+step:189800/500000 train_loss:2.0795 train_time:8297919ms step_avg:43.72ms
+step:190000/500000 train_loss:1.9859 train_time:8306658ms step_avg:43.72ms
+step:190200/500000 train_loss:2.0013 train_time:8315407ms step_avg:43.72ms
+step:190400/500000 train_loss:1.8687 train_time:8324156ms step_avg:43.72ms
+step:190600/500000 train_loss:2.0704 train_time:8332901ms step_avg:43.72ms
+step:190800/500000 train_loss:2.0072 train_time:8341640ms step_avg:43.72ms
+step:191000/500000 train_loss:1.9459 train_time:8350385ms step_avg:43.72ms
+step:191200/500000 train_loss:2.0498 train_time:8359126ms step_avg:43.72ms
+step:191400/500000 train_loss:2.0077 train_time:8367862ms step_avg:43.72ms
+step:191600/500000 train_loss:2.0151 train_time:8376604ms step_avg:43.72ms
+step:191800/500000 train_loss:2.0355 train_time:8385354ms step_avg:43.72ms
+step:192000/500000 train_loss:1.9724 train_time:8394096ms step_avg:43.72ms
+step:192200/500000 train_loss:2.0984 train_time:8402951ms step_avg:43.72ms
+step:192400/500000 train_loss:1.8789 train_time:8411689ms step_avg:43.72ms
+step:192600/500000 train_loss:1.8561 train_time:8420432ms step_avg:43.72ms
+step:192800/500000 train_loss:2.0320 train_time:8429163ms step_avg:43.72ms
+step:193000/500000 train_loss:2.1582 train_time:8437899ms step_avg:43.72ms
+step:193200/500000 train_loss:1.8950 train_time:8446638ms step_avg:43.72ms
+step:193400/500000 train_loss:1.9427 train_time:8455369ms step_avg:43.72ms
+step:193600/500000 train_loss:2.0281 train_time:8464117ms step_avg:43.72ms
+step:193800/500000 train_loss:2.1803 train_time:8472865ms step_avg:43.72ms
+step:194000/500000 train_loss:2.0341 train_time:8481600ms step_avg:43.72ms
+step:194200/500000 train_loss:2.0818 train_time:8490324ms step_avg:43.72ms
+step:194400/500000 train_loss:2.0853 train_time:8499050ms step_avg:43.72ms
+step:194600/500000 train_loss:2.1322 train_time:8507769ms step_avg:43.72ms
+step:194800/500000 train_loss:1.9417 train_time:8516499ms step_avg:43.72ms
+step:195000/500000 train_loss:1.9583 train_time:8525228ms step_avg:43.72ms
+step:195200/500000 train_loss:2.1288 train_time:8533957ms step_avg:43.72ms
+step:195400/500000 train_loss:1.8735 train_time:8542684ms step_avg:43.72ms
+step:195600/500000 train_loss:1.9380 train_time:8551417ms step_avg:43.72ms
+step:195800/500000 train_loss:2.0153 train_time:8560153ms step_avg:43.72ms
+step:196000/500000 train_loss:2.0793 train_time:8568904ms step_avg:43.72ms
+step:196200/500000 train_loss:1.9494 train_time:8577766ms step_avg:43.72ms
+step:196400/500000 train_loss:2.0027 train_time:8586500ms step_avg:43.72ms
+step:196600/500000 train_loss:1.8933 train_time:8595250ms step_avg:43.72ms
+step:196800/500000 train_loss:2.1100 train_time:8603995ms step_avg:43.72ms
+step:197000/500000 train_loss:1.9367 train_time:8612745ms step_avg:43.72ms
+step:197200/500000 train_loss:2.0373 train_time:8621490ms step_avg:43.72ms
+step:197400/500000 train_loss:1.9918 train_time:8630234ms step_avg:43.72ms
+step:197600/500000 train_loss:1.9524 train_time:8638976ms step_avg:43.72ms
+step:197800/500000 train_loss:2.0744 train_time:8647712ms step_avg:43.72ms
+step:198000/500000 train_loss:2.0879 train_time:8656445ms step_avg:43.72ms
+step:198200/500000 train_loss:1.8853 train_time:8665173ms step_avg:43.72ms
+step:198400/500000 train_loss:1.9999 train_time:8673907ms step_avg:43.72ms
+step:198600/500000 train_loss:2.0703 train_time:8682644ms step_avg:43.72ms
+step:198800/500000 train_loss:1.9407 train_time:8691376ms step_avg:43.72ms
+step:199000/500000 train_loss:1.9825 train_time:8700102ms step_avg:43.72ms
+step:199200/500000 train_loss:2.2138 train_time:8708839ms step_avg:43.72ms
+step:199400/500000 train_loss:2.0556 train_time:8717569ms step_avg:43.72ms
+step:199600/500000 train_loss:1.8311 train_time:8726291ms step_avg:43.72ms
+step:199800/500000 train_loss:1.8928 train_time:8735019ms step_avg:43.72ms
+step:200000/500000 train_loss:1.9818 train_time:8743744ms step_avg:43.72ms
+step:200000/500000 val_loss:2.0105 val_bpb:1.1907 train_time:8743760ms step_avg:43.72ms
+step:200200/500000 train_loss:1.9868 train_time:8752474ms step_avg:43.72ms
+step:200400/500000 train_loss:2.1167 train_time:8761320ms step_avg:43.72ms
+step:200600/500000 train_loss:1.9449 train_time:8770043ms step_avg:43.72ms
+step:200800/500000 train_loss:2.0322 train_time:8778769ms step_avg:43.72ms
+step:201000/500000 train_loss:2.0473 train_time:8787500ms step_avg:43.72ms
+step:201200/500000 train_loss:1.9667 train_time:8796227ms step_avg:43.72ms
+step:201400/500000 train_loss:2.1328 train_time:8804954ms step_avg:43.72ms
+step:201600/500000 train_loss:2.0573 train_time:8813695ms step_avg:43.72ms
+step:201800/500000 train_loss:1.9931 train_time:8822429ms step_avg:43.72ms
+step:202000/500000 train_loss:2.1286 train_time:8831155ms step_avg:43.72ms
+step:202200/500000 train_loss:2.0619 train_time:8839888ms step_avg:43.72ms
+step:202400/500000 train_loss:2.1035 train_time:8848616ms step_avg:43.72ms
+step:202600/500000 train_loss:2.0375 train_time:8857346ms step_avg:43.72ms
+step:202800/500000 train_loss:1.9938 train_time:8866075ms step_avg:43.72ms
+step:203000/500000 train_loss:1.8688 train_time:8874808ms step_avg:43.72ms
+step:203200/500000 train_loss:1.9930 train_time:8883537ms step_avg:43.72ms
+step:203400/500000 train_loss:1.9899 train_time:8892264ms step_avg:43.72ms
+step:203600/500000 train_loss:1.7814 train_time:8901000ms step_avg:43.72ms
+step:203800/500000 train_loss:1.9964 train_time:8909742ms step_avg:43.72ms
+step:204000/500000 train_loss:2.0461 train_time:8918478ms step_avg:43.72ms
+step:204200/500000 train_loss:1.9670 train_time:8927209ms step_avg:43.72ms
+step:204400/500000 train_loss:2.0828 train_time:8936045ms step_avg:43.72ms
+step:204600/500000 train_loss:2.0625 train_time:8944774ms step_avg:43.72ms
+step:204800/500000 train_loss:1.9721 train_time:8953505ms step_avg:43.72ms
+step:205000/500000 train_loss:1.8662 train_time:8962232ms step_avg:43.72ms
+step:205200/500000 train_loss:1.9807 train_time:8970959ms step_avg:43.72ms
+step:205400/500000 train_loss:2.0544 train_time:8979687ms step_avg:43.72ms
+step:205600/500000 train_loss:1.9625 train_time:8988423ms step_avg:43.72ms
+step:205800/500000 train_loss:2.0482 train_time:8997154ms step_avg:43.72ms
+step:206000/500000 train_loss:1.9984 train_time:9005880ms step_avg:43.72ms
+step:206200/500000 train_loss:1.9716 train_time:9014613ms step_avg:43.72ms
+step:206400/500000 train_loss:2.0728 train_time:9023343ms step_avg:43.72ms
+step:206600/500000 train_loss:1.9063 train_time:9032073ms step_avg:43.72ms
+step:206800/500000 train_loss:1.9478 train_time:9040806ms step_avg:43.72ms
+step:207000/500000 train_loss:2.0179 train_time:9049527ms step_avg:43.72ms
+step:207200/500000 train_loss:2.0109 train_time:9058254ms step_avg:43.72ms
+step:207400/500000 train_loss:2.0661 train_time:9066986ms step_avg:43.72ms
+step:207600/500000 train_loss:1.8557 train_time:9075723ms step_avg:43.72ms
+step:207800/500000 train_loss:1.8529 train_time:9084456ms step_avg:43.72ms
+step:208000/500000 train_loss:2.0855 train_time:9093188ms step_avg:43.72ms
+step:208200/500000 train_loss:2.0335 train_time:9101921ms step_avg:43.72ms
+step:208400/500000 train_loss:1.8988 train_time:9110652ms step_avg:43.72ms
+step:208600/500000 train_loss:1.9468 train_time:9119502ms step_avg:43.72ms
+step:208800/500000 train_loss:1.9999 train_time:9128232ms step_avg:43.72ms
+step:209000/500000 train_loss:2.0396 train_time:9136961ms step_avg:43.72ms
+step:209200/500000 train_loss:2.0371 train_time:9145693ms step_avg:43.72ms
+step:209400/500000 train_loss:2.1630 train_time:9154422ms step_avg:43.72ms
+step:209600/500000 train_loss:1.9490 train_time:9163147ms step_avg:43.72ms
+step:209800/500000 train_loss:1.9805 train_time:9171877ms step_avg:43.72ms
+step:210000/500000 train_loss:2.0601 train_time:9180605ms step_avg:43.72ms
+step:210200/500000 train_loss:2.0570 train_time:9189333ms step_avg:43.72ms
+step:210400/500000 train_loss:2.0318 train_time:9198063ms step_avg:43.72ms
+step:210600/500000 train_loss:2.1689 train_time:9206795ms step_avg:43.72ms
+step:210800/500000 train_loss:1.9921 train_time:9215524ms step_avg:43.72ms
+step:211000/500000 train_loss:2.0106 train_time:9224247ms step_avg:43.72ms
+step:211200/500000 train_loss:2.0720 train_time:9232976ms step_avg:43.72ms
+step:211400/500000 train_loss:2.1134 train_time:9241710ms step_avg:43.72ms
+step:211600/500000 train_loss:2.1559 train_time:9250441ms step_avg:43.72ms
+step:211800/500000 train_loss:2.0516 train_time:9259174ms step_avg:43.72ms
+step:212000/500000 train_loss:2.0935 train_time:9267899ms step_avg:43.72ms
+step:212200/500000 train_loss:2.0999 train_time:9276632ms step_avg:43.72ms
+step:212400/500000 train_loss:2.0021 train_time:9285357ms step_avg:43.72ms
+step:212600/500000 train_loss:2.0492 train_time:9294090ms step_avg:43.72ms
+step:212800/500000 train_loss:2.0105 train_time:9302943ms step_avg:43.72ms
+step:213000/500000 train_loss:1.9174 train_time:9311671ms step_avg:43.72ms
+step:213200/500000 train_loss:2.1274 train_time:9320401ms step_avg:43.72ms
+step:213400/500000 train_loss:1.9693 train_time:9329132ms step_avg:43.72ms
+step:213600/500000 train_loss:1.8956 train_time:9337862ms step_avg:43.72ms
+step:213800/500000 train_loss:2.0412 train_time:9346591ms step_avg:43.72ms
+step:214000/500000 train_loss:2.0581 train_time:9355323ms step_avg:43.72ms
+step:214200/500000 train_loss:2.0161 train_time:9364052ms step_avg:43.72ms
+step:214400/500000 train_loss:2.0053 train_time:9372785ms step_avg:43.72ms
+step:214600/500000 train_loss:2.0402 train_time:9381520ms step_avg:43.72ms
+step:214800/500000 train_loss:1.9926 train_time:9390259ms step_avg:43.72ms
+step:215000/500000 train_loss:2.0061 train_time:9398986ms step_avg:43.72ms
+step:215200/500000 train_loss:1.9774 train_time:9407719ms step_avg:43.72ms
+step:215400/500000 train_loss:2.0146 train_time:9416449ms step_avg:43.72ms
+step:215600/500000 train_loss:2.0522 train_time:9425178ms step_avg:43.72ms
+step:215800/500000 train_loss:1.9747 train_time:9433914ms step_avg:43.72ms
+step:216000/500000 train_loss:2.0128 train_time:9442642ms step_avg:43.72ms
+step:216200/500000 train_loss:2.0113 train_time:9451366ms step_avg:43.72ms
+step:216400/500000 train_loss:1.8583 train_time:9460100ms step_avg:43.72ms
+step:216600/500000 train_loss:2.0709 train_time:9468838ms step_avg:43.72ms
+step:216800/500000 train_loss:2.0007 train_time:9477686ms step_avg:43.72ms
+step:217000/500000 train_loss:2.0354 train_time:9486424ms step_avg:43.72ms
+step:217200/500000 train_loss:2.0861 train_time:9495155ms step_avg:43.72ms
+step:217400/500000 train_loss:1.9562 train_time:9503889ms step_avg:43.72ms
+step:217600/500000 train_loss:2.0061 train_time:9512625ms step_avg:43.72ms
+step:217800/500000 train_loss:1.9055 train_time:9521360ms step_avg:43.72ms
+step:218000/500000 train_loss:2.0533 train_time:9530097ms step_avg:43.72ms
+step:218200/500000 train_loss:2.0140 train_time:9538834ms step_avg:43.72ms
+step:218400/500000 train_loss:2.0301 train_time:9547568ms step_avg:43.72ms
+step:218600/500000 train_loss:2.1542 train_time:9556315ms step_avg:43.72ms
+step:218800/500000 train_loss:2.0486 train_time:9565045ms step_avg:43.72ms
+step:219000/500000 train_loss:1.9843 train_time:9573785ms step_avg:43.72ms
+step:219200/500000 train_loss:1.9709 train_time:9582517ms step_avg:43.72ms
+step:219400/500000 train_loss:2.0735 train_time:9591258ms step_avg:43.72ms
+step:219600/500000 train_loss:1.9672 train_time:9600003ms step_avg:43.72ms
+step:219800/500000 train_loss:1.9823 train_time:9608744ms step_avg:43.72ms
+step:220000/500000 train_loss:1.9377 train_time:9617487ms step_avg:43.72ms
+step:220000/500000 val_loss:2.0076 val_bpb:1.1890 train_time:9617503ms step_avg:43.72ms
+step:220200/500000 train_loss:2.0217 train_time:9626225ms step_avg:43.72ms
+step:220400/500000 train_loss:2.0827 train_time:9634959ms step_avg:43.72ms
+step:220600/500000 train_loss:2.0294 train_time:9643691ms step_avg:43.72ms
+step:220800/500000 train_loss:2.1254 train_time:9652438ms step_avg:43.72ms
+step:221000/500000 train_loss:2.1577 train_time:9661294ms step_avg:43.72ms
+step:221200/500000 train_loss:2.0463 train_time:9670034ms step_avg:43.72ms
+step:221400/500000 train_loss:2.1585 train_time:9678765ms step_avg:43.72ms
+step:221600/500000 train_loss:1.9449 train_time:9687499ms step_avg:43.72ms
+step:221800/500000 train_loss:2.1088 train_time:9696229ms step_avg:43.72ms
+step:222000/500000 train_loss:1.9759 train_time:9704959ms step_avg:43.72ms
+step:222200/500000 train_loss:1.9192 train_time:9713682ms step_avg:43.72ms
+step:222400/500000 train_loss:1.9609 train_time:9722414ms step_avg:43.72ms
+step:222600/500000 train_loss:1.9626 train_time:9731139ms step_avg:43.72ms
+step:222800/500000 train_loss:1.9320 train_time:9739836ms step_avg:43.72ms
+step:223000/500000 train_loss:2.0679 train_time:9748562ms step_avg:43.72ms
+step:223200/500000 train_loss:1.9997 train_time:9757297ms step_avg:43.72ms
+step:223400/500000 train_loss:1.8773 train_time:9766028ms step_avg:43.72ms
+step:223600/500000 train_loss:2.0217 train_time:9774760ms step_avg:43.72ms
+step:223800/500000 train_loss:2.1673 train_time:9783485ms step_avg:43.72ms
+step:224000/500000 train_loss:1.9545 train_time:9792341ms step_avg:43.72ms
+step:224200/500000 train_loss:1.9287 train_time:9801075ms step_avg:43.72ms
+step:224400/500000 train_loss:1.9226 train_time:9809802ms step_avg:43.72ms
+step:224600/500000 train_loss:2.0093 train_time:9818539ms step_avg:43.72ms
+step:224800/500000 train_loss:2.0917 train_time:9827270ms step_avg:43.72ms
+step:225000/500000 train_loss:1.8178 train_time:9836011ms step_avg:43.72ms
+step:225200/500000 train_loss:1.9130 train_time:9844760ms step_avg:43.72ms
+step:225400/500000 train_loss:2.0751 train_time:9853506ms step_avg:43.72ms
+step:225600/500000 train_loss:1.9437 train_time:9862235ms step_avg:43.72ms
+step:225800/500000 train_loss:2.0723 train_time:9870970ms step_avg:43.72ms
+step:226000/500000 train_loss:1.9647 train_time:9879707ms step_avg:43.72ms
+step:226200/500000 train_loss:2.0678 train_time:9888446ms step_avg:43.72ms
+step:226400/500000 train_loss:2.0256 train_time:9897181ms step_avg:43.72ms
+step:226600/500000 train_loss:2.0307 train_time:9905921ms step_avg:43.72ms
+step:226800/500000 train_loss:2.1227 train_time:9914660ms step_avg:43.72ms
+step:227000/500000 train_loss:2.0353 train_time:9923391ms step_avg:43.72ms
+step:227200/500000 train_loss:1.9588 train_time:9932120ms step_avg:43.72ms
+step:227400/500000 train_loss:2.0887 train_time:9940856ms step_avg:43.72ms
+step:227600/500000 train_loss:1.9178 train_time:9949585ms step_avg:43.72ms
+step:227800/500000 train_loss:2.0666 train_time:9958313ms step_avg:43.72ms
+step:228000/500000 train_loss:1.9225 train_time:9967159ms step_avg:43.72ms
+step:228200/500000 train_loss:1.9397 train_time:9975883ms step_avg:43.72ms
+step:228400/500000 train_loss:1.9910 train_time:9984618ms step_avg:43.72ms
+step:228600/500000 train_loss:1.9247 train_time:9993362ms step_avg:43.72ms
+step:228800/500000 train_loss:1.8871 train_time:10002097ms step_avg:43.72ms
+step:229000/500000 train_loss:2.0247 train_time:10010830ms step_avg:43.72ms
+step:229200/500000 train_loss:1.9405 train_time:10019566ms step_avg:43.72ms
+step:229400/500000 train_loss:2.0672 train_time:10028294ms step_avg:43.72ms
+step:229600/500000 train_loss:2.0012 train_time:10037022ms step_avg:43.72ms
+step:229800/500000 train_loss:2.0390 train_time:10045754ms step_avg:43.72ms
+step:230000/500000 train_loss:1.9441 train_time:10054490ms step_avg:43.72ms
+step:230200/500000 train_loss:1.9243 train_time:10063234ms step_avg:43.72ms
+step:230400/500000 train_loss:2.0568 train_time:10071966ms step_avg:43.72ms
+step:230600/500000 train_loss:1.7721 train_time:10080697ms step_avg:43.72ms
+step:230800/500000 train_loss:1.9805 train_time:10089428ms step_avg:43.72ms
+step:231000/500000 train_loss:1.8422 train_time:10098161ms step_avg:43.71ms
+step:231200/500000 train_loss:2.0484 train_time:10106900ms step_avg:43.71ms
+step:231400/500000 train_loss:2.0169 train_time:10115633ms step_avg:43.71ms
+step:231600/500000 train_loss:2.0759 train_time:10124365ms step_avg:43.71ms
+step:231800/500000 train_loss:1.9531 train_time:10133092ms step_avg:43.71ms
+step:232000/500000 train_loss:2.0485 train_time:10141817ms step_avg:43.71ms
+step:232200/500000 train_loss:1.9279 train_time:10150679ms step_avg:43.72ms
+step:232400/500000 train_loss:2.1539 train_time:10159418ms step_avg:43.72ms
+step:232600/500000 train_loss:2.0334 train_time:10168164ms step_avg:43.72ms
+step:232800/500000 train_loss:1.7609 train_time:10176898ms step_avg:43.72ms
+step:233000/500000 train_loss:2.0500 train_time:10185626ms step_avg:43.72ms
+step:233200/500000 train_loss:1.9637 train_time:10194360ms step_avg:43.72ms
+step:233400/500000 train_loss:2.0705 train_time:10203093ms step_avg:43.72ms
+step:233600/500000 train_loss:2.0482 train_time:10211826ms step_avg:43.72ms
+step:233800/500000 train_loss:2.1037 train_time:10220562ms step_avg:43.71ms
+step:234000/500000 train_loss:1.9615 train_time:10229297ms step_avg:43.71ms
+step:234200/500000 train_loss:1.9095 train_time:10238033ms step_avg:43.71ms
+step:234400/500000 train_loss:1.9257 train_time:10246759ms step_avg:43.71ms
+step:234600/500000 train_loss:2.2710 train_time:10255487ms step_avg:43.71ms
+step:234800/500000 train_loss:1.9472 train_time:10264212ms step_avg:43.71ms
+step:235000/500000 train_loss:2.0240 train_time:10272947ms step_avg:43.71ms
+step:235200/500000 train_loss:2.0640 train_time:10281676ms step_avg:43.71ms
+step:235400/500000 train_loss:1.8869 train_time:10290411ms step_avg:43.71ms
+step:235600/500000 train_loss:1.9445 train_time:10299144ms step_avg:43.71ms
+step:235800/500000 train_loss:2.0251 train_time:10307873ms step_avg:43.71ms
+step:236000/500000 train_loss:1.9722 train_time:10316597ms step_avg:43.71ms
+step:236200/500000 train_loss:2.0008 train_time:10325324ms step_avg:43.71ms
+step:236400/500000 train_loss:2.1860 train_time:10334172ms step_avg:43.71ms
+step:236600/500000 train_loss:1.9035 train_time:10342897ms step_avg:43.71ms
+step:236800/500000 train_loss:1.9173 train_time:10351628ms step_avg:43.71ms
+step:237000/500000 train_loss:2.0712 train_time:10360356ms step_avg:43.71ms
+step:237200/500000 train_loss:1.9455 train_time:10369078ms step_avg:43.71ms
+step:237400/500000 train_loss:2.0799 train_time:10377812ms step_avg:43.71ms
+step:237600/500000 train_loss:1.9462 train_time:10386543ms step_avg:43.71ms
+step:237800/500000 train_loss:2.0383 train_time:10395286ms step_avg:43.71ms
+step:238000/500000 train_loss:1.9322 train_time:10404027ms step_avg:43.71ms
+step:238200/500000 train_loss:1.9640 train_time:10412765ms step_avg:43.71ms
+step:238400/500000 train_loss:2.0001 train_time:10421496ms step_avg:43.71ms
+step:238600/500000 train_loss:2.1391 train_time:10430231ms step_avg:43.71ms
+step:238800/500000 train_loss:1.9923 train_time:10438966ms step_avg:43.71ms
+step:239000/500000 train_loss:2.0646 train_time:10447711ms step_avg:43.71ms
+step:239200/500000 train_loss:1.8668 train_time:10456446ms step_avg:43.71ms
+step:239400/500000 train_loss:2.4861 train_time:10465183ms step_avg:43.71ms
+step:239600/500000 train_loss:1.8213 train_time:10473921ms step_avg:43.71ms
+step:239800/500000 train_loss:2.0816 train_time:10482658ms step_avg:43.71ms
+step:240000/500000 train_loss:1.8781 train_time:10491396ms step_avg:43.71ms
+step:240000/500000 val_loss:2.0063 val_bpb:1.1883 train_time:10491412ms step_avg:43.71ms
+step:240200/500000 train_loss:1.9510 train_time:10500136ms step_avg:43.71ms
+step:240400/500000 train_loss:1.9606 train_time:10508991ms step_avg:43.71ms
+step:240600/500000 train_loss:1.9815 train_time:10517721ms step_avg:43.71ms
+step:240800/500000 train_loss:1.9837 train_time:10526454ms step_avg:43.71ms
+step:241000/500000 train_loss:2.0349 train_time:10535193ms step_avg:43.71ms
+step:241200/500000 train_loss:1.8801 train_time:10543928ms step_avg:43.71ms
+step:241400/500000 train_loss:1.9354 train_time:10552661ms step_avg:43.71ms
+step:241600/500000 train_loss:1.8986 train_time:10561395ms step_avg:43.71ms
+step:241800/500000 train_loss:2.1174 train_time:10570128ms step_avg:43.71ms
+step:242000/500000 train_loss:1.9755 train_time:10578858ms step_avg:43.71ms
+step:242200/500000 train_loss:1.9471 train_time:10587594ms step_avg:43.71ms
+step:242400/500000 train_loss:2.0002 train_time:10596330ms step_avg:43.71ms
+step:242600/500000 train_loss:1.9670 train_time:10605064ms step_avg:43.71ms
+step:242800/500000 train_loss:1.8907 train_time:10613816ms step_avg:43.71ms
+step:243000/500000 train_loss:2.0368 train_time:10622560ms step_avg:43.71ms
+step:243200/500000 train_loss:2.0311 train_time:10631301ms step_avg:43.71ms
+step:243400/500000 train_loss:1.8762 train_time:10640036ms step_avg:43.71ms
+step:243600/500000 train_loss:1.9528 train_time:10648777ms step_avg:43.71ms
+step:243800/500000 train_loss:2.0887 train_time:10657512ms step_avg:43.71ms
+step:244000/500000 train_loss:2.0231 train_time:10666246ms step_avg:43.71ms
+step:244200/500000 train_loss:2.0592 train_time:10674982ms step_avg:43.71ms
+step:244400/500000 train_loss:2.0455 train_time:10683715ms step_avg:43.71ms
+step:244600/500000 train_loss:1.9829 train_time:10692567ms step_avg:43.71ms
+step:244800/500000 train_loss:1.8873 train_time:10701300ms step_avg:43.71ms
+step:245000/500000 train_loss:1.9246 train_time:10710035ms step_avg:43.71ms
+step:245200/500000 train_loss:1.9532 train_time:10718767ms step_avg:43.71ms
+step:245400/500000 train_loss:2.0863 train_time:10727502ms step_avg:43.71ms
+step:245600/500000 train_loss:2.0128 train_time:10736234ms step_avg:43.71ms
+step:245800/500000 train_loss:2.0420 train_time:10744968ms step_avg:43.71ms
+step:246000/500000 train_loss:1.9984 train_time:10753707ms step_avg:43.71ms
+step:246200/500000 train_loss:1.9309 train_time:10762446ms step_avg:43.71ms
+step:246400/500000 train_loss:2.0014 train_time:10771184ms step_avg:43.71ms
+step:246600/500000 train_loss:1.9898 train_time:10779923ms step_avg:43.71ms
+step:246800/500000 train_loss:1.9892 train_time:10788658ms step_avg:43.71ms
+step:247000/500000 train_loss:2.0507 train_time:10797397ms step_avg:43.71ms
+step:247200/500000 train_loss:1.8842 train_time:10806137ms step_avg:43.71ms
+step:247400/500000 train_loss:1.9936 train_time:10814874ms step_avg:43.71ms
+step:247600/500000 train_loss:2.0653 train_time:10823607ms step_avg:43.71ms
+step:247800/500000 train_loss:2.0729 train_time:10832353ms step_avg:43.71ms
+step:248000/500000 train_loss:1.9083 train_time:10841094ms step_avg:43.71ms
+step:248200/500000 train_loss:1.9371 train_time:10849830ms step_avg:43.71ms
+step:248400/500000 train_loss:1.9112 train_time:10858566ms step_avg:43.71ms
+step:248600/500000 train_loss:2.0996 train_time:10867428ms step_avg:43.71ms
+step:248800/500000 train_loss:1.9620 train_time:10876162ms step_avg:43.71ms
+step:249000/500000 train_loss:2.0224 train_time:10884895ms step_avg:43.71ms
+step:249200/500000 train_loss:1.9578 train_time:10893631ms step_avg:43.71ms
+step:249400/500000 train_loss:1.8937 train_time:10902369ms step_avg:43.71ms
+step:249600/500000 train_loss:1.8381 train_time:10911107ms step_avg:43.71ms
+step:249800/500000 train_loss:2.0133 train_time:10919845ms step_avg:43.71ms
+step:250000/500000 train_loss:2.0581 train_time:10928578ms step_avg:43.71ms
+step:250200/500000 train_loss:1.9867 train_time:10937312ms step_avg:43.71ms
+step:250400/500000 train_loss:1.9796 train_time:10946038ms step_avg:43.71ms
+step:250600/500000 train_loss:2.0323 train_time:10954770ms step_avg:43.71ms
+step:250800/500000 train_loss:1.8523 train_time:10963508ms step_avg:43.71ms
+step:251000/500000 train_loss:1.9193 train_time:10972245ms step_avg:43.71ms
+step:251200/500000 train_loss:1.9896 train_time:10980982ms step_avg:43.71ms
+step:251400/500000 train_loss:1.9091 train_time:10989715ms step_avg:43.71ms
+step:251600/500000 train_loss:1.9960 train_time:10998454ms step_avg:43.71ms
+step:251800/500000 train_loss:2.0287 train_time:11007188ms step_avg:43.71ms
+step:252000/500000 train_loss:1.9612 train_time:11015923ms step_avg:43.71ms
+step:252200/500000 train_loss:2.0508 train_time:11024659ms step_avg:43.71ms
+step:252400/500000 train_loss:1.9624 train_time:11033389ms step_avg:43.71ms
+step:252600/500000 train_loss:1.9766 train_time:11042125ms step_avg:43.71ms
+step:252800/500000 train_loss:2.0657 train_time:11050997ms step_avg:43.71ms
+step:253000/500000 train_loss:1.9535 train_time:11059732ms step_avg:43.71ms
+step:253200/500000 train_loss:2.0059 train_time:11068474ms step_avg:43.71ms
+step:253400/500000 train_loss:1.9705 train_time:11077211ms step_avg:43.71ms
+step:253600/500000 train_loss:2.0051 train_time:11085946ms step_avg:43.71ms
+step:253800/500000 train_loss:1.8818 train_time:11094681ms step_avg:43.71ms
+step:254000/500000 train_loss:2.0705 train_time:11103415ms step_avg:43.71ms
+step:254200/500000 train_loss:1.9941 train_time:11112156ms step_avg:43.71ms
+step:254400/500000 train_loss:1.9379 train_time:11120890ms step_avg:43.71ms
+step:254600/500000 train_loss:1.9741 train_time:11129625ms step_avg:43.71ms
+step:254800/500000 train_loss:1.8901 train_time:11138356ms step_avg:43.71ms
+step:255000/500000 train_loss:1.9534 train_time:11147089ms step_avg:43.71ms
+step:255200/500000 train_loss:2.0204 train_time:11155825ms step_avg:43.71ms
+step:255400/500000 train_loss:1.8525 train_time:11164563ms step_avg:43.71ms
+step:255600/500000 train_loss:2.0028 train_time:11173305ms step_avg:43.71ms
+step:255800/500000 train_loss:1.9450 train_time:11182037ms step_avg:43.71ms
+step:256000/500000 train_loss:2.0375 train_time:11190773ms step_avg:43.71ms
+step:256200/500000 train_loss:1.9353 train_time:11199508ms step_avg:43.71ms
+step:256400/500000 train_loss:2.1361 train_time:11208247ms step_avg:43.71ms
+step:256600/500000 train_loss:2.0129 train_time:11216977ms step_avg:43.71ms
+step:256800/500000 train_loss:2.0725 train_time:11225838ms step_avg:43.71ms
+step:257000/500000 train_loss:2.1332 train_time:11234569ms step_avg:43.71ms
+step:257200/500000 train_loss:1.7644 train_time:11243314ms step_avg:43.71ms
+step:257400/500000 train_loss:2.0439 train_time:11252048ms step_avg:43.71ms
+step:257600/500000 train_loss:1.9964 train_time:11260788ms step_avg:43.71ms
+step:257800/500000 train_loss:2.0681 train_time:11269525ms step_avg:43.71ms
+step:258000/500000 train_loss:1.9432 train_time:11278255ms step_avg:43.71ms
+step:258200/500000 train_loss:1.9413 train_time:11286980ms step_avg:43.71ms
+step:258400/500000 train_loss:2.1118 train_time:11295708ms step_avg:43.71ms
+step:258600/500000 train_loss:1.9258 train_time:11304449ms step_avg:43.71ms
+step:258800/500000 train_loss:2.0245 train_time:11313180ms step_avg:43.71ms
+step:259000/500000 train_loss:2.0141 train_time:11321910ms step_avg:43.71ms
+step:259200/500000 train_loss:1.9900 train_time:11330641ms step_avg:43.71ms
+step:259400/500000 train_loss:2.4748 train_time:11339371ms step_avg:43.71ms
+step:259600/500000 train_loss:2.0246 train_time:11348100ms step_avg:43.71ms
+step:259800/500000 train_loss:1.9782 train_time:11356835ms step_avg:43.71ms
+step:260000/500000 train_loss:1.9953 train_time:11365662ms step_avg:43.71ms
+step:260000/500000 val_loss:2.0041 val_bpb:1.1870 train_time:11365677ms step_avg:43.71ms
+step:260200/500000 train_loss:1.9975 train_time:11374405ms step_avg:43.71ms
+step:260400/500000 train_loss:2.0585 train_time:11383133ms step_avg:43.71ms
+step:260600/500000 train_loss:2.0042 train_time:11391866ms step_avg:43.71ms
+step:260800/500000 train_loss:1.9907 train_time:11400605ms step_avg:43.71ms
+step:261000/500000 train_loss:1.8036 train_time:11409345ms step_avg:43.71ms
+step:261200/500000 train_loss:1.9688 train_time:11418070ms step_avg:43.71ms
+step:261400/500000 train_loss:2.0001 train_time:11426801ms step_avg:43.71ms
+step:261600/500000 train_loss:1.7854 train_time:11435534ms step_avg:43.71ms
+step:261800/500000 train_loss:1.9908 train_time:11444271ms step_avg:43.71ms
+step:262000/500000 train_loss:2.0764 train_time:11453013ms step_avg:43.71ms
+step:262200/500000 train_loss:1.8727 train_time:11461753ms step_avg:43.71ms
+step:262400/500000 train_loss:2.6147 train_time:11470498ms step_avg:43.71ms
+step:262600/500000 train_loss:2.0029 train_time:11479235ms step_avg:43.71ms
+step:262800/500000 train_loss:2.2071 train_time:11487964ms step_avg:43.71ms
+step:263000/500000 train_loss:2.0746 train_time:11496701ms step_avg:43.71ms
+step:263200/500000 train_loss:2.1073 train_time:11505436ms step_avg:43.71ms
+step:263400/500000 train_loss:1.9842 train_time:11514169ms step_avg:43.71ms
+step:263600/500000 train_loss:2.0114 train_time:11522905ms step_avg:43.71ms
+step:263800/500000 train_loss:2.0245 train_time:11531645ms step_avg:43.71ms
+step:264000/500000 train_loss:2.0311 train_time:11540511ms step_avg:43.71ms
+step:264200/500000 train_loss:1.9754 train_time:11549243ms step_avg:43.71ms
+step:264400/500000 train_loss:2.0189 train_time:11557976ms step_avg:43.71ms
+step:264600/500000 train_loss:2.0092 train_time:11566712ms step_avg:43.71ms
+step:264800/500000 train_loss:1.9114 train_time:11575445ms step_avg:43.71ms
+step:265000/500000 train_loss:1.8924 train_time:11584189ms step_avg:43.71ms
+step:265200/500000 train_loss:2.0347 train_time:11592928ms step_avg:43.71ms
+step:265400/500000 train_loss:2.0220 train_time:11601670ms step_avg:43.71ms
+step:265600/500000 train_loss:2.0149 train_time:11610396ms step_avg:43.71ms
+step:265800/500000 train_loss:2.0498 train_time:11619137ms step_avg:43.71ms
+step:266000/500000 train_loss:2.0556 train_time:11627871ms step_avg:43.71ms
+step:266200/500000 train_loss:1.9562 train_time:11636604ms step_avg:43.71ms
+step:266400/500000 train_loss:2.0086 train_time:11645338ms step_avg:43.71ms
+step:266600/500000 train_loss:2.0249 train_time:11654089ms step_avg:43.71ms
+step:266800/500000 train_loss:1.9725 train_time:11662827ms step_avg:43.71ms
+step:267000/500000 train_loss:2.0500 train_time:11671566ms step_avg:43.71ms
+step:267200/500000 train_loss:2.0668 train_time:11680299ms step_avg:43.71ms
+step:267400/500000 train_loss:1.9282 train_time:11689039ms step_avg:43.71ms
+step:267600/500000 train_loss:1.9192 train_time:11697772ms step_avg:43.71ms
+step:267800/500000 train_loss:1.9734 train_time:11706515ms step_avg:43.71ms
+step:268000/500000 train_loss:2.0209 train_time:11715254ms step_avg:43.71ms
+step:268200/500000 train_loss:1.9353 train_time:11724109ms step_avg:43.71ms
+step:268400/500000 train_loss:2.0996 train_time:11732847ms step_avg:43.71ms
+step:268600/500000 train_loss:1.9886 train_time:11741584ms step_avg:43.71ms
+step:268800/500000 train_loss:2.0374 train_time:11750317ms step_avg:43.71ms
+step:269000/500000 train_loss:2.0219 train_time:11759057ms step_avg:43.71ms
+step:269200/500000 train_loss:1.8802 train_time:11767793ms step_avg:43.71ms
+step:269400/500000 train_loss:2.0656 train_time:11776526ms step_avg:43.71ms
+step:269600/500000 train_loss:1.9443 train_time:11785264ms step_avg:43.71ms
+step:269800/500000 train_loss:1.9677 train_time:11793999ms step_avg:43.71ms
+step:270000/500000 train_loss:2.0042 train_time:11802735ms step_avg:43.71ms
+step:270200/500000 train_loss:1.9749 train_time:11811467ms step_avg:43.71ms
+step:270400/500000 train_loss:2.0969 train_time:11820205ms step_avg:43.71ms
+step:270600/500000 train_loss:2.1183 train_time:11828939ms step_avg:43.71ms
+step:270800/500000 train_loss:1.8985 train_time:11837662ms step_avg:43.71ms
+step:271000/500000 train_loss:2.0729 train_time:11846398ms step_avg:43.71ms
+step:271200/500000 train_loss:2.0153 train_time:11855135ms step_avg:43.71ms
+step:271400/500000 train_loss:2.0576 train_time:11863864ms step_avg:43.71ms
+step:271600/500000 train_loss:1.9227 train_time:11872593ms step_avg:43.71ms
+step:271800/500000 train_loss:1.9194 train_time:11881327ms step_avg:43.71ms
+step:272000/500000 train_loss:2.0508 train_time:11890061ms step_avg:43.71ms
+step:272200/500000 train_loss:1.9958 train_time:11898917ms step_avg:43.71ms
+step:272400/500000 train_loss:2.0412 train_time:11907652ms step_avg:43.71ms
+step:272600/500000 train_loss:2.1157 train_time:11916387ms step_avg:43.71ms
+step:272800/500000 train_loss:1.9742 train_time:11925122ms step_avg:43.71ms
+step:273000/500000 train_loss:1.9292 train_time:11933857ms step_avg:43.71ms
+step:273200/500000 train_loss:2.2138 train_time:11942593ms step_avg:43.71ms
+step:273400/500000 train_loss:2.0001 train_time:11951335ms step_avg:43.71ms
+step:273600/500000 train_loss:2.0742 train_time:11960070ms step_avg:43.71ms
+step:273800/500000 train_loss:2.0358 train_time:11968805ms step_avg:43.71ms
+step:274000/500000 train_loss:1.9466 train_time:11977538ms step_avg:43.71ms
+step:274200/500000 train_loss:2.1008 train_time:11986276ms step_avg:43.71ms
+step:274400/500000 train_loss:1.9161 train_time:11995003ms step_avg:43.71ms
+step:274600/500000 train_loss:1.9558 train_time:12003738ms step_avg:43.71ms
+step:274800/500000 train_loss:2.2432 train_time:12012473ms step_avg:43.71ms
+step:275000/500000 train_loss:2.0272 train_time:12021207ms step_avg:43.71ms
+step:275200/500000 train_loss:2.0214 train_time:12029946ms step_avg:43.71ms
+step:275400/500000 train_loss:1.9085 train_time:12038674ms step_avg:43.71ms
+step:275600/500000 train_loss:2.0615 train_time:12047412ms step_avg:43.71ms
+step:275800/500000 train_loss:2.1550 train_time:12056142ms step_avg:43.71ms
+step:276000/500000 train_loss:2.1168 train_time:12064879ms step_avg:43.71ms
+step:276200/500000 train_loss:1.9004 train_time:12073619ms step_avg:43.71ms
+step:276400/500000 train_loss:1.8931 train_time:12082474ms step_avg:43.71ms
+step:276600/500000 train_loss:2.0315 train_time:12091206ms step_avg:43.71ms
+step:276800/500000 train_loss:2.0176 train_time:12099933ms step_avg:43.71ms
+step:277000/500000 train_loss:2.0328 train_time:12108668ms step_avg:43.71ms
+step:277200/500000 train_loss:1.9919 train_time:12117401ms step_avg:43.71ms
+step:277400/500000 train_loss:2.0349 train_time:12126135ms step_avg:43.71ms
+step:277600/500000 train_loss:1.9732 train_time:12134862ms step_avg:43.71ms
+step:277800/500000 train_loss:2.0866 train_time:12143592ms step_avg:43.71ms
+step:278000/500000 train_loss:1.9639 train_time:12152324ms step_avg:43.71ms
+step:278200/500000 train_loss:1.8576 train_time:12161051ms step_avg:43.71ms
+step:278400/500000 train_loss:1.9091 train_time:12169780ms step_avg:43.71ms
+step:278600/500000 train_loss:1.9311 train_time:12178513ms step_avg:43.71ms
+step:278800/500000 train_loss:2.0320 train_time:12187246ms step_avg:43.71ms
+step:279000/500000 train_loss:2.2007 train_time:12195978ms step_avg:43.71ms
+step:279200/500000 train_loss:1.9399 train_time:12204704ms step_avg:43.71ms
+step:279400/500000 train_loss:2.1437 train_time:12213438ms step_avg:43.71ms
+step:279600/500000 train_loss:1.8597 train_time:12222170ms step_avg:43.71ms
+step:279800/500000 train_loss:1.8687 train_time:12230902ms step_avg:43.71ms
+step:280000/500000 train_loss:1.9473 train_time:12239631ms step_avg:43.71ms
+step:280000/500000 val_loss:2.0059 val_bpb:1.1880 train_time:12239647ms step_avg:43.71ms
+step:280200/500000 train_loss:1.8587 train_time:12248371ms step_avg:43.71ms
+step:280400/500000 train_loss:1.9275 train_time:12257226ms step_avg:43.71ms
+step:280600/500000 train_loss:2.0865 train_time:12265963ms step_avg:43.71ms
+step:280800/500000 train_loss:2.0359 train_time:12274697ms step_avg:43.71ms
+step:281000/500000 train_loss:2.0256 train_time:12283428ms step_avg:43.71ms
+step:281200/500000 train_loss:2.0721 train_time:12292168ms step_avg:43.71ms
+step:281400/500000 train_loss:1.9527 train_time:12300901ms step_avg:43.71ms
+step:281600/500000 train_loss:2.0435 train_time:12309642ms step_avg:43.71ms
+step:281800/500000 train_loss:1.9409 train_time:12318374ms step_avg:43.71ms
+step:282000/500000 train_loss:1.8839 train_time:12327110ms step_avg:43.71ms
+step:282200/500000 train_loss:1.9873 train_time:12335849ms step_avg:43.71ms
+step:282400/500000 train_loss:2.0421 train_time:12344583ms step_avg:43.71ms
+step:282600/500000 train_loss:2.1513 train_time:12353319ms step_avg:43.71ms
+step:282800/500000 train_loss:1.9599 train_time:12362052ms step_avg:43.71ms
+step:283000/500000 train_loss:2.1058 train_time:12370792ms step_avg:43.71ms
+step:283200/500000 train_loss:2.0484 train_time:12379527ms step_avg:43.71ms
+step:283400/500000 train_loss:2.0516 train_time:12388269ms step_avg:43.71ms
+step:283600/500000 train_loss:1.9050 train_time:12397005ms step_avg:43.71ms
+step:283800/500000 train_loss:1.9629 train_time:12405746ms step_avg:43.71ms
+step:284000/500000 train_loss:1.9291 train_time:12414495ms step_avg:43.71ms
+step:284200/500000 train_loss:2.0472 train_time:12423237ms step_avg:43.71ms
+step:284400/500000 train_loss:2.0072 train_time:12431976ms step_avg:43.71ms
+step:284600/500000 train_loss:1.9916 train_time:12440841ms step_avg:43.71ms
+step:284800/500000 train_loss:1.9631 train_time:12449579ms step_avg:43.71ms
+step:285000/500000 train_loss:2.0340 train_time:12458321ms step_avg:43.71ms
+step:285200/500000 train_loss:1.9090 train_time:12467053ms step_avg:43.71ms
+step:285400/500000 train_loss:2.1183 train_time:12475792ms step_avg:43.71ms
+step:285600/500000 train_loss:2.0143 train_time:12484524ms step_avg:43.71ms
+step:285800/500000 train_loss:2.0114 train_time:12493265ms step_avg:43.71ms
+step:286000/500000 train_loss:2.1389 train_time:12502004ms step_avg:43.71ms
+step:286200/500000 train_loss:1.9397 train_time:12510739ms step_avg:43.71ms
+step:286400/500000 train_loss:1.9368 train_time:12519480ms step_avg:43.71ms
+step:286600/500000 train_loss:2.0340 train_time:12528218ms step_avg:43.71ms
+step:286800/500000 train_loss:2.0173 train_time:12536952ms step_avg:43.71ms
+step:287000/500000 train_loss:2.0685 train_time:12545692ms step_avg:43.71ms
+step:287200/500000 train_loss:2.0053 train_time:12554435ms step_avg:43.71ms
+step:287400/500000 train_loss:1.9727 train_time:12563172ms step_avg:43.71ms
+step:287600/500000 train_loss:1.9363 train_time:12571910ms step_avg:43.71ms
+step:287800/500000 train_loss:2.4742 train_time:12580644ms step_avg:43.71ms
+step:288000/500000 train_loss:2.0207 train_time:12589381ms step_avg:43.71ms
+step:288200/500000 train_loss:1.9881 train_time:12598121ms step_avg:43.71ms
+step:288400/500000 train_loss:2.0589 train_time:12606861ms step_avg:43.71ms
+step:288600/500000 train_loss:1.9856 train_time:12615613ms step_avg:43.71ms
+step:288800/500000 train_loss:1.9577 train_time:12624470ms step_avg:43.71ms
+step:289000/500000 train_loss:2.0183 train_time:12633200ms step_avg:43.71ms
+step:289200/500000 train_loss:1.8817 train_time:12641936ms step_avg:43.71ms
+step:289400/500000 train_loss:2.1841 train_time:12650673ms step_avg:43.71ms
+step:289600/500000 train_loss:2.0147 train_time:12659407ms step_avg:43.71ms
+step:289800/500000 train_loss:2.0655 train_time:12668147ms step_avg:43.71ms
+step:290000/500000 train_loss:2.1097 train_time:12676876ms step_avg:43.71ms
+step:290200/500000 train_loss:1.9592 train_time:12685620ms step_avg:43.71ms
+step:290400/500000 train_loss:2.0262 train_time:12694358ms step_avg:43.71ms
+step:290600/500000 train_loss:1.9529 train_time:12703095ms step_avg:43.71ms
+step:290800/500000 train_loss:2.0480 train_time:12711840ms step_avg:43.71ms
+step:291000/500000 train_loss:1.8834 train_time:12720580ms step_avg:43.71ms
+step:291200/500000 train_loss:2.3031 train_time:12729321ms step_avg:43.71ms
+step:291400/500000 train_loss:1.9826 train_time:12738057ms step_avg:43.71ms
+step:291600/500000 train_loss:2.1083 train_time:12746796ms step_avg:43.71ms
+step:291800/500000 train_loss:2.0582 train_time:12755531ms step_avg:43.71ms
+step:292000/500000 train_loss:1.9632 train_time:12764274ms step_avg:43.71ms
+step:292200/500000 train_loss:2.3024 train_time:12773015ms step_avg:43.71ms
+step:292400/500000 train_loss:2.0066 train_time:12781752ms step_avg:43.71ms
+step:292600/500000 train_loss:1.9131 train_time:12790492ms step_avg:43.71ms
+step:292800/500000 train_loss:1.7997 train_time:12799355ms step_avg:43.71ms
+step:293000/500000 train_loss:1.9488 train_time:12808094ms step_avg:43.71ms
+step:293200/500000 train_loss:1.9176 train_time:12816830ms step_avg:43.71ms
+step:293400/500000 train_loss:1.9941 train_time:12825563ms step_avg:43.71ms
+step:293600/500000 train_loss:1.9393 train_time:12834298ms step_avg:43.71ms
+step:293800/500000 train_loss:2.0138 train_time:12843033ms step_avg:43.71ms
+step:294000/500000 train_loss:1.9956 train_time:12851779ms step_avg:43.71ms
+step:294200/500000 train_loss:2.2086 train_time:12860523ms step_avg:43.71ms
+step:294400/500000 train_loss:2.0443 train_time:12869263ms step_avg:43.71ms
+step:294600/500000 train_loss:2.0434 train_time:12878001ms step_avg:43.71ms
+step:294800/500000 train_loss:1.9583 train_time:12886742ms step_avg:43.71ms
+step:295000/500000 train_loss:2.0998 train_time:12895484ms step_avg:43.71ms
+step:295200/500000 train_loss:2.0722 train_time:12904220ms step_avg:43.71ms
+step:295400/500000 train_loss:1.9461 train_time:12912961ms step_avg:43.71ms
+step:295600/500000 train_loss:2.1570 train_time:12921702ms step_avg:43.71ms
+step:295800/500000 train_loss:2.0004 train_time:12930439ms step_avg:43.71ms
+step:296000/500000 train_loss:2.0067 train_time:12939179ms step_avg:43.71ms
+step:296200/500000 train_loss:2.0775 train_time:12947917ms step_avg:43.71ms
+step:296400/500000 train_loss:1.9923 train_time:12956652ms step_avg:43.71ms
+step:296600/500000 train_loss:1.9149 train_time:12965382ms step_avg:43.71ms
+step:296800/500000 train_loss:2.0540 train_time:12974118ms step_avg:43.71ms
+step:297000/500000 train_loss:2.0463 train_time:12982942ms step_avg:43.71ms
+step:297200/500000 train_loss:2.0057 train_time:12991673ms step_avg:43.71ms
+step:297400/500000 train_loss:2.0463 train_time:13000411ms step_avg:43.71ms
+step:297600/500000 train_loss:1.9426 train_time:13009145ms step_avg:43.71ms
+step:297800/500000 train_loss:2.1075 train_time:13017877ms step_avg:43.71ms
+step:298000/500000 train_loss:2.0239 train_time:13026611ms step_avg:43.71ms
+step:298200/500000 train_loss:2.0610 train_time:13035344ms step_avg:43.71ms
+step:298400/500000 train_loss:1.9703 train_time:13044080ms step_avg:43.71ms
+step:298600/500000 train_loss:2.0391 train_time:13052807ms step_avg:43.71ms
+step:298800/500000 train_loss:2.0636 train_time:13061543ms step_avg:43.71ms
+step:299000/500000 train_loss:2.0894 train_time:13070275ms step_avg:43.71ms
+step:299200/500000 train_loss:2.0629 train_time:13079003ms step_avg:43.71ms
+step:299400/500000 train_loss:1.9541 train_time:13087740ms step_avg:43.71ms
+step:299600/500000 train_loss:1.9493 train_time:13096471ms step_avg:43.71ms
+step:299800/500000 train_loss:1.9702 train_time:13105202ms step_avg:43.71ms
+step:300000/500000 train_loss:2.1034 train_time:13114057ms step_avg:43.71ms
+step:300000/500000 val_loss:2.0019 val_bpb:1.1856 train_time:13114073ms step_avg:43.71ms
+step:300200/500000 train_loss:2.0228 train_time:13122789ms step_avg:43.71ms
+step:300400/500000 train_loss:1.9918 train_time:13131517ms step_avg:43.71ms
+step:300600/500000 train_loss:1.9048 train_time:13140246ms step_avg:43.71ms
+step:300800/500000 train_loss:1.9942 train_time:13148989ms step_avg:43.71ms
+step:301000/500000 train_loss:2.1526 train_time:13157717ms step_avg:43.71ms
+step:301200/500000 train_loss:1.9766 train_time:13166455ms step_avg:43.71ms
+step:301400/500000 train_loss:2.0515 train_time:13175188ms step_avg:43.71ms
+step:301600/500000 train_loss:1.8824 train_time:13183920ms step_avg:43.71ms
+step:301800/500000 train_loss:1.8421 train_time:13192653ms step_avg:43.71ms
+step:302000/500000 train_loss:2.0031 train_time:13201380ms step_avg:43.71ms
+step:302200/500000 train_loss:1.9957 train_time:13210119ms step_avg:43.71ms
+step:302400/500000 train_loss:2.1431 train_time:13218856ms step_avg:43.71ms
+step:302600/500000 train_loss:2.0113 train_time:13227593ms step_avg:43.71ms
+step:302800/500000 train_loss:2.0139 train_time:13236331ms step_avg:43.71ms
+step:303000/500000 train_loss:1.9975 train_time:13245066ms step_avg:43.71ms
+step:303200/500000 train_loss:2.1686 train_time:13253809ms step_avg:43.71ms
+step:303400/500000 train_loss:1.9958 train_time:13262553ms step_avg:43.71ms
+step:303600/500000 train_loss:2.0634 train_time:13271294ms step_avg:43.71ms
+step:303800/500000 train_loss:1.8341 train_time:13280026ms step_avg:43.71ms
+step:304000/500000 train_loss:2.0196 train_time:13288891ms step_avg:43.71ms
+step:304200/500000 train_loss:2.0038 train_time:13297621ms step_avg:43.71ms
+step:304400/500000 train_loss:2.1533 train_time:13306354ms step_avg:43.71ms
+step:304600/500000 train_loss:1.9385 train_time:13315087ms step_avg:43.71ms
+step:304800/500000 train_loss:1.9258 train_time:13323823ms step_avg:43.71ms
+step:305000/500000 train_loss:2.0313 train_time:13332561ms step_avg:43.71ms
+step:305200/500000 train_loss:2.0963 train_time:13341296ms step_avg:43.71ms
+step:305400/500000 train_loss:1.9291 train_time:13350029ms step_avg:43.71ms
+step:305600/500000 train_loss:2.1894 train_time:13358764ms step_avg:43.71ms
+step:305800/500000 train_loss:1.9851 train_time:13367498ms step_avg:43.71ms
+step:306000/500000 train_loss:2.0476 train_time:13376229ms step_avg:43.71ms
+step:306200/500000 train_loss:1.9089 train_time:13384965ms step_avg:43.71ms
+step:306400/500000 train_loss:1.9578 train_time:13393701ms step_avg:43.71ms
+step:306600/500000 train_loss:1.9322 train_time:13402431ms step_avg:43.71ms
+step:306800/500000 train_loss:1.9681 train_time:13411166ms step_avg:43.71ms
+step:307000/500000 train_loss:1.9248 train_time:13419899ms step_avg:43.71ms
+step:307200/500000 train_loss:2.0327 train_time:13428633ms step_avg:43.71ms
+step:307400/500000 train_loss:2.1124 train_time:13437369ms step_avg:43.71ms
+step:307600/500000 train_loss:1.8934 train_time:13446105ms step_avg:43.71ms
+step:307800/500000 train_loss:1.9565 train_time:13454841ms step_avg:43.71ms
+step:308000/500000 train_loss:1.9663 train_time:13463575ms step_avg:43.71ms
+step:308200/500000 train_loss:2.0848 train_time:13472431ms step_avg:43.71ms
+step:308400/500000 train_loss:1.9285 train_time:13481167ms step_avg:43.71ms
+step:308600/500000 train_loss:2.0670 train_time:13489905ms step_avg:43.71ms
+step:308800/500000 train_loss:1.8746 train_time:13498638ms step_avg:43.71ms
+step:309000/500000 train_loss:1.9112 train_time:13507371ms step_avg:43.71ms
+step:309200/500000 train_loss:1.9494 train_time:13516103ms step_avg:43.71ms
+step:309400/500000 train_loss:1.9585 train_time:13524834ms step_avg:43.71ms
+step:309600/500000 train_loss:2.0978 train_time:13533563ms step_avg:43.71ms
+step:309800/500000 train_loss:1.7424 train_time:13542293ms step_avg:43.71ms
+step:310000/500000 train_loss:2.0507 train_time:13551024ms step_avg:43.71ms
+step:310200/500000 train_loss:1.9832 train_time:13559759ms step_avg:43.71ms
+step:310400/500000 train_loss:2.0606 train_time:13568497ms step_avg:43.71ms
+step:310600/500000 train_loss:1.9982 train_time:13577225ms step_avg:43.71ms
+step:310800/500000 train_loss:1.9846 train_time:13585959ms step_avg:43.71ms
+step:311000/500000 train_loss:2.1402 train_time:13594684ms step_avg:43.71ms
+step:311200/500000 train_loss:1.9878 train_time:13603418ms step_avg:43.71ms
+step:311400/500000 train_loss:2.0720 train_time:13612147ms step_avg:43.71ms
+step:311600/500000 train_loss:1.9734 train_time:13620878ms step_avg:43.71ms
+step:311800/500000 train_loss:2.0040 train_time:13629609ms step_avg:43.71ms
+step:312000/500000 train_loss:1.9738 train_time:13638336ms step_avg:43.71ms
+step:312200/500000 train_loss:2.0241 train_time:13647067ms step_avg:43.71ms
+step:312400/500000 train_loss:1.9978 train_time:13655930ms step_avg:43.71ms
+step:312600/500000 train_loss:2.0128 train_time:13664662ms step_avg:43.71ms
+step:312800/500000 train_loss:2.0693 train_time:13673392ms step_avg:43.71ms
+step:313000/500000 train_loss:2.0319 train_time:13682118ms step_avg:43.71ms
+step:313200/500000 train_loss:1.9284 train_time:13690852ms step_avg:43.71ms
+step:313400/500000 train_loss:1.8951 train_time:13699586ms step_avg:43.71ms
+step:313600/500000 train_loss:2.0394 train_time:13708322ms step_avg:43.71ms
+step:313800/500000 train_loss:2.0955 train_time:13717054ms step_avg:43.71ms
+step:314000/500000 train_loss:1.9987 train_time:13725790ms step_avg:43.71ms
+step:314200/500000 train_loss:2.0925 train_time:13734528ms step_avg:43.71ms
+step:314400/500000 train_loss:1.9578 train_time:13743269ms step_avg:43.71ms
+step:314600/500000 train_loss:2.0680 train_time:13752011ms step_avg:43.71ms
+step:314800/500000 train_loss:2.0073 train_time:13760745ms step_avg:43.71ms
+step:315000/500000 train_loss:2.0866 train_time:13769479ms step_avg:43.71ms
+step:315200/500000 train_loss:1.9914 train_time:13778210ms step_avg:43.71ms
+step:315400/500000 train_loss:1.9911 train_time:13786945ms step_avg:43.71ms
+step:315600/500000 train_loss:2.1094 train_time:13795674ms step_avg:43.71ms
+step:315800/500000 train_loss:1.9628 train_time:13804407ms step_avg:43.71ms
+step:316000/500000 train_loss:1.9922 train_time:13813140ms step_avg:43.71ms
+step:316200/500000 train_loss:1.9326 train_time:13821872ms step_avg:43.71ms
+step:316400/500000 train_loss:1.9757 train_time:13830721ms step_avg:43.71ms
+step:316600/500000 train_loss:1.9799 train_time:13839462ms step_avg:43.71ms
+step:316800/500000 train_loss:2.0344 train_time:13848182ms step_avg:43.71ms
+step:317000/500000 train_loss:1.9756 train_time:13856912ms step_avg:43.71ms
+step:317200/500000 train_loss:1.8813 train_time:13865639ms step_avg:43.71ms
+step:317400/500000 train_loss:2.0688 train_time:13874365ms step_avg:43.71ms
+step:317600/500000 train_loss:2.1282 train_time:13883090ms step_avg:43.71ms
+step:317800/500000 train_loss:2.0416 train_time:13891818ms step_avg:43.71ms
+step:318000/500000 train_loss:1.9459 train_time:13900547ms step_avg:43.71ms
+step:318200/500000 train_loss:2.1129 train_time:13909275ms step_avg:43.71ms
+step:318400/500000 train_loss:1.9522 train_time:13918006ms step_avg:43.71ms
+step:318600/500000 train_loss:2.0662 train_time:13926738ms step_avg:43.71ms
+step:318800/500000 train_loss:2.0459 train_time:13935471ms step_avg:43.71ms
+step:319000/500000 train_loss:2.0817 train_time:13944202ms step_avg:43.71ms
+step:319200/500000 train_loss:2.0718 train_time:13952933ms step_avg:43.71ms
+step:319400/500000 train_loss:2.0248 train_time:13961668ms step_avg:43.71ms
+step:319600/500000 train_loss:2.0153 train_time:13970395ms step_avg:43.71ms
+step:319800/500000 train_loss:1.9507 train_time:13979128ms step_avg:43.71ms
+step:320000/500000 train_loss:1.9200 train_time:13987858ms step_avg:43.71ms
+step:320000/500000 val_loss:2.0010 val_bpb:1.1851 train_time:13987873ms step_avg:43.71ms
+step:320200/500000 train_loss:2.0915 train_time:13996588ms step_avg:43.71ms
+step:320400/500000 train_loss:1.9669 train_time:14005321ms step_avg:43.71ms
+step:320600/500000 train_loss:2.0496 train_time:14014180ms step_avg:43.71ms
+step:320800/500000 train_loss:2.1033 train_time:14022909ms step_avg:43.71ms
+step:321000/500000 train_loss:1.8803 train_time:14031642ms step_avg:43.71ms
+step:321200/500000 train_loss:1.9655 train_time:14040378ms step_avg:43.71ms
+step:321400/500000 train_loss:2.0748 train_time:14049121ms step_avg:43.71ms
+step:321600/500000 train_loss:2.0919 train_time:14057851ms step_avg:43.71ms
+step:321800/500000 train_loss:2.0200 train_time:14066585ms step_avg:43.71ms
+step:322000/500000 train_loss:2.0471 train_time:14075312ms step_avg:43.71ms
+step:322200/500000 train_loss:1.9971 train_time:14084041ms step_avg:43.71ms
+step:322400/500000 train_loss:1.8674 train_time:14092769ms step_avg:43.71ms
+step:322600/500000 train_loss:1.9619 train_time:14101496ms step_avg:43.71ms
+step:322800/500000 train_loss:2.0699 train_time:14110228ms step_avg:43.71ms
+step:323000/500000 train_loss:2.4914 train_time:14118957ms step_avg:43.71ms
+step:323200/500000 train_loss:1.9907 train_time:14127689ms step_avg:43.71ms
+step:323400/500000 train_loss:2.0156 train_time:14136416ms step_avg:43.71ms
+step:323600/500000 train_loss:2.0626 train_time:14145143ms step_avg:43.71ms
+step:323800/500000 train_loss:1.9674 train_time:14153871ms step_avg:43.71ms
+step:324000/500000 train_loss:2.0524 train_time:14162599ms step_avg:43.71ms
+step:324200/500000 train_loss:2.0582 train_time:14171336ms step_avg:43.71ms
+step:324400/500000 train_loss:1.8915 train_time:14180068ms step_avg:43.71ms
+step:324600/500000 train_loss:2.0369 train_time:14188927ms step_avg:43.71ms
+step:324800/500000 train_loss:2.0923 train_time:14197654ms step_avg:43.71ms
+step:325000/500000 train_loss:1.9624 train_time:14206393ms step_avg:43.71ms
+step:325200/500000 train_loss:2.0010 train_time:14215134ms step_avg:43.71ms
+step:325400/500000 train_loss:2.0460 train_time:14223861ms step_avg:43.71ms
+step:325600/500000 train_loss:2.0737 train_time:14232589ms step_avg:43.71ms
+step:325800/500000 train_loss:1.8715 train_time:14241323ms step_avg:43.71ms
+step:326000/500000 train_loss:1.9732 train_time:14250055ms step_avg:43.71ms
+step:326200/500000 train_loss:1.8768 train_time:14258799ms step_avg:43.71ms
+step:326400/500000 train_loss:1.9460 train_time:14267540ms step_avg:43.71ms
+step:326600/500000 train_loss:1.9567 train_time:14276274ms step_avg:43.71ms
+step:326800/500000 train_loss:1.9932 train_time:14285014ms step_avg:43.71ms
+step:327000/500000 train_loss:2.0671 train_time:14293752ms step_avg:43.71ms
+step:327200/500000 train_loss:1.9528 train_time:14302485ms step_avg:43.71ms
+step:327400/500000 train_loss:2.0451 train_time:14311217ms step_avg:43.71ms
+step:327600/500000 train_loss:2.0360 train_time:14319952ms step_avg:43.71ms
+step:327800/500000 train_loss:2.0273 train_time:14328692ms step_avg:43.71ms
+step:328000/500000 train_loss:2.1318 train_time:14337428ms step_avg:43.71ms
+step:328200/500000 train_loss:2.0219 train_time:14346167ms step_avg:43.71ms
+step:328400/500000 train_loss:1.8702 train_time:14354904ms step_avg:43.71ms
+step:328600/500000 train_loss:2.2112 train_time:14363648ms step_avg:43.71ms
+step:328800/500000 train_loss:2.0007 train_time:14372507ms step_avg:43.71ms
+step:329000/500000 train_loss:1.9333 train_time:14381249ms step_avg:43.71ms
+step:329200/500000 train_loss:1.8833 train_time:14389989ms step_avg:43.71ms
+step:329400/500000 train_loss:1.9639 train_time:14398731ms step_avg:43.71ms
+step:329430/500000 val_loss:1.9837 val_bpb:1.1749 train_time:14400039ms step_avg:43.71ms
+stopping_early: wallclock_cap train_time:14400039ms step:329430/500000
+peak memory allocated: 10184 MiB reserved: 10588 MiB
+Serialized model: 67224983 bytes
+Code size: 47642 bytes
+Total submission size: 67272625 bytes
+Serialized model int8+zlib: 15762519 bytes (payload:17178912 raw_torch:17224025 payload_ratio:3.91x)
+Total submission size int8+zlib: 15810161 bytes
+final_int8_zlib_roundtrip val_loss:2.0386 val_bpb:1.2074 eval_time:1356ms
+final_int8_zlib_roundtrip_exact val_loss:2.03860961 val_bpb:1.20737944
diff --git a/records/track_non_record_16mb/2026-03-18_Quasi10Bfrom50B_SP1024_9x512_KV4_4h_pgut3/train_gpt.py b/records/track_non_record_16mb/2026-03-18_Quasi10Bfrom50B_SP1024_9x512_KV4_4h_pgut3/train_gpt.py
new file mode 100644
index 0000000..0deb056
--- /dev/null
+++ b/records/track_non_record_16mb/2026-03-18_Quasi10Bfrom50B_SP1024_9x512_KV4_4h_pgut3/train_gpt.py
@@ -0,0 +1,1126 @@
+"""
+The `train_gpt.py` and `train_gpt_mlx.py` scripts are intended as good launching-off points for new participants, not SOTA configs. We'll accept PRs that tune, improve, or simplify these scripts without significantly increasing complexity, but competitive submissions should stay in the `/records` folder.
+
+Hard stop: `train_gpt.py` and `train_gpt_mlx.py` must never be longer than 1500 lines.
+"""
+
+from __future__ import annotations
+
+import copy
+import glob
+import io
+import math
+import os
+import random
+import subprocess
+import sys
+import time
+import uuid
+import zlib
+from pathlib import Path
+
+import numpy as np
+import sentencepiece as spm
+import torch
+import torch.distributed as dist
+import torch.nn.functional as F
+from torch import Tensor, nn
+from torch.nn.parallel import DistributedDataParallel as DDP
+
+# -----------------------------
+# HYPERPARAMETERS
+# -----------------------------
+# Default Simple Baseline run:
+# - 9 transformer blocks at width 512
+# - 8 attention heads with 4 KV heads (GQA) and 2x MLP expansion
+# - vocab size 1024, sequence length 1024, tied embeddings
+# - 524,288 train tokens per step for 20,000 iterations with a ~10 minute cap
+
+class Hyperparameters:
+    # Data paths are shard globs produced by the existing preprocessing pipeline.
+    data_path = os.environ.get("DATA_PATH", "./data/datasets/fineweb10B_sp1024")
+    train_files = os.path.join(data_path, "fineweb_train_*.bin")
+    val_files = os.path.join(data_path, "fineweb_val_*.bin")
+    tokenizer_path = os.environ.get("TOKENIZER_PATH", "./data/tokenizers/fineweb_1024_bpe.model")
+    run_id = os.environ.get("RUN_ID", str(uuid.uuid4()))
+    seed = int(os.environ.get("SEED", 1337))
+
+    # Validation cadence and batch size. Validation always uses the full fineweb_val split.
+    val_batch_size = int(os.environ.get("VAL_BATCH_SIZE", 524_288))
+    val_loss_every = int(os.environ.get("VAL_LOSS_EVERY", 1000))
+    train_log_every = int(os.environ.get("TRAIN_LOG_EVERY", 200))
+
+    # Training length.
+    iterations = int(os.environ.get("ITERATIONS", 20000))
+    warmdown_iters = int(os.environ.get("WARMDOWN_ITERS", 1200))
+    warmup_steps = int(os.environ.get("WARMUP_STEPS", 20))
+    train_batch_tokens = int(os.environ.get("TRAIN_BATCH_TOKENS", 524_288))
+    train_seq_len = int(os.environ.get("TRAIN_SEQ_LEN", 1024))
+    max_wallclock_seconds = float(os.environ.get("MAX_WALLCLOCK_SECONDS", 600.0))
+    qk_gain_init = float(os.environ.get("QK_GAIN_INIT", 1.5))
+
+    # Model shape.
+    vocab_size = int(os.environ.get("VOCAB_SIZE", 1024))
+    num_layers = int(os.environ.get("NUM_LAYERS", 9))
+    num_kv_heads = int(os.environ.get("NUM_KV_HEADS", 4))
+    model_dim = int(os.environ.get("MODEL_DIM", 512))
+    num_heads = int(os.environ.get("NUM_HEADS", 8))
+    mlp_mult = int(os.environ.get("MLP_MULT", 2))
+    tie_embeddings = bool(int(os.environ.get("TIE_EMBEDDINGS", "1")))
+    rope_base = float(os.environ.get("ROPE_BASE", 10000.0))
+    logit_softcap = float(os.environ.get("LOGIT_SOFTCAP", 30.0))
+
+    # Optimizer hyperparameters.
+    embed_lr = float(os.environ.get("EMBED_LR", 0.6))
+    head_lr = float(os.environ.get("HEAD_LR", 0.008))
+    tied_embed_lr = float(os.environ.get("TIED_EMBED_LR", 0.05))
+    tied_embed_init_std = float(os.environ.get("TIED_EMBED_INIT_STD", 0.005))
+    matrix_lr = float(os.environ.get("MATRIX_LR", 0.04))
+    scalar_lr = float(os.environ.get("SCALAR_LR", 0.04))
+    muon_momentum = float(os.environ.get("MUON_MOMENTUM", 0.95))
+    muon_backend_steps = int(os.environ.get("MUON_BACKEND_STEPS", 5))
+    muon_momentum_warmup_start = float(os.environ.get("MUON_MOMENTUM_WARMUP_START", 0.85))
+    muon_momentum_warmup_steps = int(os.environ.get("MUON_MOMENTUM_WARMUP_STEPS", 500))
+    beta1 = float(os.environ.get("BETA1", 0.9))
+    beta2 = float(os.environ.get("BETA2", 0.95))
+    adam_eps = float(os.environ.get("ADAM_EPS", 1e-8))
+    grad_clip_norm = float(os.environ.get("GRAD_CLIP_NORM", 0.0))
+
+# -----------------------------
+# MUON OPTIMIZER 
+# -----------------------------
+# 
+# As borrowed from modded-nanogpt
+# Background on Muon: https://kellerjordan.github.io/posts/muon/
+
+def zeropower_via_newtonschulz5(G: Tensor, steps: int = 10, eps: float = 1e-7) -> Tensor:
+    # Orthogonalize a 2D update matrix with a fast Newton-Schulz iteration.
+    # Muon uses this to normalize matrix-shaped gradients before applying them.
+    a, b, c = (3.4445, -4.7750, 2.0315)
+    X = G.bfloat16()
+    X /= X.norm() + eps
+    transposed = G.size(0) > G.size(1)
+    if transposed:
+        X = X.T
+    for _ in range(steps):
+        A = X @ X.T
+        B = b * A + c * A @ A
+        X = a * X + B @ X
+    return X.T if transposed else X
+
+
+class Muon(torch.optim.Optimizer):
+    def __init__(self, params, lr: float, momentum: float, backend_steps: int, nesterov: bool = True):
+        super().__init__(
+            params,
+            dict(lr=lr, momentum=momentum, backend_steps=backend_steps, nesterov=nesterov),
+        )
+
+    @torch.no_grad()
+    def step(self, closure=None):
+        loss = None
+        if closure is not None:
+            with torch.enable_grad():
+                loss = closure()
+
+        distributed = dist.is_available() and dist.is_initialized()
+        world_size = dist.get_world_size() if distributed else 1
+        rank = dist.get_rank() if distributed else 0
+
+        for group in self.param_groups:
+            params = group["params"]
+            if not params:
+                continue
+            lr = group["lr"]
+            momentum = group["momentum"]
+            backend_steps = group["backend_steps"]
+            nesterov = group["nesterov"]
+
+            total_params = sum(int(p.numel()) for p in params)
+            updates_flat = torch.zeros(total_params, device=params[0].device, dtype=torch.bfloat16)
+
+            curr = 0
+            for i, p in enumerate(params):
+                if i % world_size == rank and p.grad is not None:
+                    g = p.grad
+                    state = self.state[p]
+                    if "momentum_buffer" not in state:
+                        state["momentum_buffer"] = torch.zeros_like(g)
+                    buf = state["momentum_buffer"]
+                    buf.mul_(momentum).add_(g)
+                    if nesterov:
+                        g = g.add(buf, alpha=momentum)
+                    g = zeropower_via_newtonschulz5(g, steps=backend_steps)
+                    # Scale correction from Muon reference implementations.
+                    g *= max(1, g.size(0) / g.size(1)) ** 0.5
+                    updates_flat[curr : curr + p.numel()] = g.reshape(-1)
+                curr += p.numel()
+
+            if distributed:
+                dist.all_reduce(updates_flat, op=dist.ReduceOp.SUM)
+
+            curr = 0
+            for p in params:
+                g = updates_flat[curr : curr + p.numel()].view_as(p).to(dtype=p.dtype)
+                p.add_(g, alpha=-lr)
+                curr += p.numel()
+
+        return loss
+
+
+# -----------------------------
+# TOKENIZER-AGNOSTIC EVALUATION SETUP 
+# -----------------------------
+#
+# It's common for small models have a large fraction of their parameters be embeddings, since the 2 * d_model * d_vocab vectors can be gigantic.
+# Instead of locking the tokenizer, we let you bring your own and calculate our validation metrics on the average compression of the validation set.
+# We calculate BPB (bits-per-byte) instead of validation loss, so we need methods to count the number of bits per token in the tokenizer.
+# Note: Submissions that edit the tokenizer will be examined more carefully, since screwing this up might unjustly improve your score.
+
+def build_sentencepiece_luts(
+    sp: spm.SentencePieceProcessor, vocab_size: int, device: torch.device
+) -> tuple[Tensor, Tensor, Tensor]:
+    sp_vocab_size = int(sp.vocab_size())
+    table_size = max(sp_vocab_size, vocab_size)
+    base_bytes_np = np.zeros((table_size,), dtype=np.int16)
+    has_leading_space_np = np.zeros((table_size,), dtype=np.bool_)
+    is_boundary_token_np = np.ones((table_size,), dtype=np.bool_)
+    for token_id in range(sp_vocab_size):
+        if sp.is_control(token_id) or sp.is_unknown(token_id) or sp.is_unused(token_id):
+            continue
+        is_boundary_token_np[token_id] = False
+        if sp.is_byte(token_id):
+            base_bytes_np[token_id] = 1
+            continue
+        piece = sp.id_to_piece(token_id)
+        if piece.startswith("▁"):
+            has_leading_space_np[token_id] = True
+            piece = piece[1:]
+        base_bytes_np[token_id] = len(piece.encode("utf-8"))
+    return (
+        torch.tensor(base_bytes_np, dtype=torch.int16, device=device),
+        torch.tensor(has_leading_space_np, dtype=torch.bool, device=device),
+        torch.tensor(is_boundary_token_np, dtype=torch.bool, device=device),
+    )
+
+
+def load_validation_tokens(pattern: str, seq_len: int) -> Tensor:
+    files = [Path(p) for p in sorted(glob.glob(pattern))]
+    if not files:
+        raise FileNotFoundError(f"No files found for pattern: {pattern}")
+    # The export pipeline writes the fixed first-50k-doc validation set to fineweb_val_*.
+    tokens = torch.cat([load_data_shard(file) for file in files]).contiguous()
+    usable = ((tokens.numel() - 1) // seq_len) * seq_len
+    if usable <= 0:
+        raise ValueError(f"Validation split is too short for TRAIN_SEQ_LEN={seq_len}")
+    return tokens[: usable + 1]
+
+
+def eval_val(
+    args: Hyperparameters,
+    model: nn.Module,
+    rank: int,
+    world_size: int,
+    device: torch.device,
+    grad_accum_steps: int,
+    val_tokens: Tensor,
+    base_bytes_lut: Tensor,
+    has_leading_space_lut: Tensor,
+    is_boundary_token_lut: Tensor,
+) -> tuple[float, float]:
+    # Validation computes two metrics:
+    # - val_loss: token cross-entropy (natural log)
+    # - val_bpb: tokenizer-agnostic compression metric used by the challenge
+    local_batch_tokens = args.val_batch_size // (world_size * grad_accum_steps)
+    if local_batch_tokens < args.train_seq_len:
+        raise ValueError(
+            "VAL_BATCH_SIZE must provide at least one sequence per rank; "
+            f"got VAL_BATCH_SIZE={args.val_batch_size}, WORLD_SIZE={world_size}, "
+            f"GRAD_ACCUM_STEPS={grad_accum_steps}, TRAIN_SEQ_LEN={args.train_seq_len}"
+        )
+    local_batch_seqs = local_batch_tokens // args.train_seq_len
+    total_seqs = (val_tokens.numel() - 1) // args.train_seq_len
+    seq_start = (total_seqs * rank) // world_size
+    seq_end = (total_seqs * (rank + 1)) // world_size
+    val_loss_sum = torch.zeros((), device=device, dtype=torch.float64)
+    val_token_count = torch.zeros((), device=device, dtype=torch.float64)
+    val_byte_count = torch.zeros((), device=device, dtype=torch.float64)
+
+    model.eval()
+    with torch.inference_mode():
+        for batch_seq_start in range(seq_start, seq_end, local_batch_seqs):
+            batch_seq_end = min(batch_seq_start + local_batch_seqs, seq_end)
+            raw_start = batch_seq_start * args.train_seq_len
+            raw_end = batch_seq_end * args.train_seq_len + 1
+            local = val_tokens[raw_start:raw_end].to(device=device, dtype=torch.int64, non_blocking=True)
+            x = local[:-1].reshape(-1, args.train_seq_len)
+            y = local[1:].reshape(-1, args.train_seq_len)
+            with torch.autocast(device_type="cuda", dtype=torch.bfloat16, enabled=True):
+                batch_loss = model(x, y).detach()
+            batch_token_count = float(y.numel())
+            val_loss_sum += batch_loss.to(torch.float64) * batch_token_count
+            val_token_count += batch_token_count
+            prev_ids = x.reshape(-1)
+            tgt_ids = y.reshape(-1)
+            token_bytes = base_bytes_lut[tgt_ids].to(dtype=torch.int16)
+            token_bytes += (has_leading_space_lut[tgt_ids] & ~is_boundary_token_lut[prev_ids]).to(dtype=torch.int16)
+            val_byte_count += token_bytes.to(torch.float64).sum()
+
+    if dist.is_available() and dist.is_initialized():
+        dist.all_reduce(val_loss_sum, op=dist.ReduceOp.SUM)
+        dist.all_reduce(val_token_count, op=dist.ReduceOp.SUM)
+        dist.all_reduce(val_byte_count, op=dist.ReduceOp.SUM)
+
+    val_loss = val_loss_sum / val_token_count
+    bits_per_token = val_loss.item() / math.log(2.0)
+    tokens_per_byte = val_token_count.item() / val_byte_count.item()
+    model.train()
+    return float(val_loss.item()), float(bits_per_token * tokens_per_byte)
+
+# -----------------------------
+# POST-TRAINING QUANTIZATION
+# -----------------------------
+#
+# It's silly to export our model, which is trained in bf16 and fp32, at that same precision.
+# Instead, we get approximately the same model (with a small hit) by quantizing the model to int8 & zlib compressing.
+# We can then decompress the model and run in higher precision for evaluation, after closing in under the size limit.
+
+CONTROL_TENSOR_NAME_PATTERNS = tuple(
+    pattern
+    for pattern in os.environ.get(
+        "CONTROL_TENSOR_NAME_PATTERNS",
+        "attn_scale,attn_scales,mlp_scale,mlp_scales,resid_mix,resid_mixes,q_gain,skip_weight,skip_weights",
+    ).split(",")
+    if pattern
+)
+INT8_KEEP_FLOAT_FP32_NAME_PATTERNS = tuple(
+    pattern
+    for pattern in os.environ.get(
+        "INT8_KEEP_FLOAT_FP32_NAME_PATTERNS",
+        ",".join(CONTROL_TENSOR_NAME_PATTERNS),
+    ).split(",")
+    if pattern
+)
+INT8_KEEP_FLOAT_MAX_NUMEL = 65_536
+INT8_KEEP_FLOAT_STORE_DTYPE = torch.float16
+INT8_PER_ROW_SCALE_DTYPE = torch.float16
+INT8_CLIP_PERCENTILE = 99.99984
+INT8_CLIP_Q = INT8_CLIP_PERCENTILE / 100.0
+
+def tensor_nbytes(t: Tensor) -> int:
+    return int(t.numel()) * int(t.element_size())
+
+def keep_float_tensor(name: str, t: Tensor, passthrough_orig_dtypes: dict[str, str]) -> Tensor:
+    if any(pattern in name for pattern in INT8_KEEP_FLOAT_FP32_NAME_PATTERNS):
+        return t.float().contiguous()
+    if t.dtype in {torch.float32, torch.bfloat16}:
+        passthrough_orig_dtypes[name] = str(t.dtype).removeprefix("torch.")
+        return t.to(dtype=INT8_KEEP_FLOAT_STORE_DTYPE).contiguous()
+    return t
+
+def quantize_float_tensor(t: Tensor) -> tuple[Tensor, Tensor]:
+    t32 = t.float()
+    if t32.ndim == 2:
+        # Matrices get one scale per row, which usually tracks output-channel
+        # ranges much better than a single tensor-wide scale.
+        clip_abs = (
+            torch.quantile(t32.abs(), INT8_CLIP_Q, dim=1)
+            if t32.numel()
+            else torch.empty((t32.shape[0],), dtype=torch.float32)
+        )
+        clipped = torch.maximum(torch.minimum(t32, clip_abs[:, None]), -clip_abs[:, None])
+        scale = (clip_abs / 127.0).clamp_min(1.0 / 127.0)
+        q = torch.clamp(torch.round(clipped / scale[:, None]), -127, 127).to(torch.int8).contiguous()
+        return q, scale.to(dtype=INT8_PER_ROW_SCALE_DTYPE).contiguous()
+
+    # Vectors / scalars use a simpler per-tensor scale.
+    clip_abs = float(torch.quantile(t32.abs().flatten(), INT8_CLIP_Q).item()) if t32.numel() else 0.0
+    scale = torch.tensor(clip_abs / 127.0 if clip_abs > 0 else 1.0, dtype=torch.float32)
+    q = torch.clamp(torch.round(torch.clamp(t32, -clip_abs, clip_abs) / scale), -127, 127).to(torch.int8).contiguous()
+    return q, scale
+
+def quantize_state_dict_int8(state_dict: dict[str, Tensor]):
+    # Single supported clean-script export format:
+    # - per-row int8 for 2D float tensors
+    # - per-tensor int8 for other float tensors
+    # - exact passthrough for non-floats
+    # - passthrough for small float tensors, stored as fp16 to save bytes
+    quantized: dict[str, Tensor] = {}
+    scales: dict[str, Tensor] = {}
+    dtypes: dict[str, str] = {}
+    passthrough: dict[str, Tensor] = {}
+    passthrough_orig_dtypes: dict[str, str] = {}
+    qmeta: dict[str, dict[str, object]] = {}
+    stats = dict.fromkeys(
+        ("param_count", "num_tensors", "num_float_tensors", "num_nonfloat_tensors", "baseline_tensor_bytes", "int8_payload_bytes"),
+        0,
+    )
+
+    for name, tensor in state_dict.items():
+        t = tensor.detach().to("cpu").contiguous()
+        stats["param_count"] += int(t.numel())
+        stats["num_tensors"] += 1
+        stats["baseline_tensor_bytes"] += tensor_nbytes(t)
+
+        if not t.is_floating_point():
+            stats["num_nonfloat_tensors"] += 1
+            passthrough[name] = t
+            stats["int8_payload_bytes"] += tensor_nbytes(t)
+            continue
+
+        # Small float tensors are cheap enough to keep directly. We still downcast
+        # fp32/bf16 passthrough tensors to fp16 so metadata does not dominate size.
+        if t.numel() <= INT8_KEEP_FLOAT_MAX_NUMEL:
+            kept = keep_float_tensor(name, t, passthrough_orig_dtypes)
+            passthrough[name] = kept
+            stats["int8_payload_bytes"] += tensor_nbytes(kept)
+            continue
+
+        stats["num_float_tensors"] += 1
+        q, s = quantize_float_tensor(t)
+        if s.ndim > 0:
+            qmeta[name] = {"scheme": "per_row", "axis": 0}
+        quantized[name] = q
+        scales[name] = s
+        dtypes[name] = str(t.dtype).removeprefix("torch.")
+        stats["int8_payload_bytes"] += tensor_nbytes(q) + tensor_nbytes(s)
+
+    obj: dict[str, object] = {
+        "__quant_format__": "int8_clean_per_row_v1",
+        "quantized": quantized,
+        "scales": scales,
+        "dtypes": dtypes,
+        "passthrough": passthrough,
+    }
+    if qmeta:
+        obj["qmeta"] = qmeta
+    if passthrough_orig_dtypes:
+        obj["passthrough_orig_dtypes"] = passthrough_orig_dtypes
+    return obj, stats
+
+def dequantize_state_dict_int8(obj: dict[str, object]) -> dict[str, Tensor]:
+    out: dict[str, Tensor] = {}
+    qmeta = obj.get("qmeta", {})
+    passthrough_orig_dtypes = obj.get("passthrough_orig_dtypes", {})
+    for name, q in obj["quantized"].items():
+        dtype = getattr(torch, obj["dtypes"][name])
+        s = obj["scales"][name]
+        if qmeta.get(name, {}).get("scheme") == "per_row" or s.ndim > 0:
+            s = s.to(dtype=torch.float32)
+            # Broadcast the saved row scale back across trailing dimensions.
+            out[name] = (q.float() * s.view(q.shape[0], *([1] * (q.ndim - 1)))).to(dtype=dtype).contiguous()
+        else:
+            scale = float(s.item())
+            out[name] = (q.float() * scale).to(dtype=dtype).contiguous()
+    for name, t in obj["passthrough"].items():
+        # Restore small tensors, undoing the temporary fp16 storage cast if needed.
+        out_t = t.detach().to("cpu").contiguous()
+        orig_dtype = passthrough_orig_dtypes.get(name)
+        if isinstance(orig_dtype, str):
+            out_t = out_t.to(dtype=getattr(torch, orig_dtype)).contiguous()
+        out[name] = out_t
+    return out
+
+
+# -----------------------------
+# DATA LOADING 
+# -----------------------------
+
+def load_data_shard(file: Path) -> Tensor:
+    header_bytes = 256 * np.dtype("<i4").itemsize
+    token_bytes = np.dtype("<u2").itemsize
+    header = np.fromfile(file, dtype="<i4", count=256)
+    # SHARD HEADER INTS & SHARD_MAGIC
+    if header.size != 256 or int(header[0]) != 20240520 or int(header[1]) != 1:
+        raise ValueError(f"Unexpected shard header for {file}")
+    num_tokens = int(header[2])
+    expected_size = header_bytes + num_tokens * token_bytes
+    if file.stat().st_size != expected_size:
+        raise ValueError(f"Shard size mismatch for {file}: expected {expected_size} bytes")
+    tokens_np = np.fromfile(file, dtype="<u2", count=num_tokens, offset=header_bytes)
+    if tokens_np.size != num_tokens:
+        raise ValueError(f"Short read for {file}")
+    return torch.from_numpy(tokens_np.astype(np.uint16, copy=False))
+
+
+class TokenStream:
+    # Reads shards sequentially and wraps around forever. The training loop therefore
+    # has deterministic, simple streaming behavior with no sampling or workers.
+    def __init__(self, pattern: str):
+        self.files = [Path(p) for p in sorted(glob.glob(pattern))]
+        if not self.files:
+            raise FileNotFoundError(f"No files found for pattern: {pattern}")
+        self.file_idx = 0
+        self.tokens = load_data_shard(self.files[0])
+        self.pos = 0
+
+    def _advance_file(self) -> None:
+        self.file_idx = (self.file_idx + 1) % len(self.files)
+        self.tokens = load_data_shard(self.files[self.file_idx])
+        self.pos = 0
+
+    def take(self, n: int) -> Tensor:
+        chunks: list[Tensor] = []
+        remaining = n
+        while remaining > 0:
+            avail = self.tokens.numel() - self.pos
+            if avail <= 0:
+                self._advance_file()
+                continue
+            k = min(remaining, avail)
+            chunks.append(self.tokens[self.pos : self.pos + k])
+            self.pos += k
+            remaining -= k
+        return chunks[0] if len(chunks) == 1 else torch.cat(chunks)
+
+
+class DistributedTokenLoader:
+    # Each call consumes a contiguous chunk from the shared token stream, then slices out
+    # one disjoint span per rank. The extra "+1" token lets us build (x, y) by shifting.
+    def __init__(self, pattern: str, rank: int, world_size: int, device: torch.device):
+        self.rank = rank
+        self.world_size = world_size
+        self.device = device
+        self.stream = TokenStream(pattern)
+
+    def next_batch(self, global_tokens: int, seq_len: int, grad_accum_steps: int) -> tuple[Tensor, Tensor]:
+        local_tokens = global_tokens // (self.world_size * grad_accum_steps)
+        per_rank_span = local_tokens + 1
+        chunk = self.stream.take(per_rank_span * self.world_size)
+        start = self.rank * per_rank_span
+        local = chunk[start : start + per_rank_span].to(dtype=torch.int64)
+        x = local[:-1].reshape(-1, seq_len)
+        y = local[1:].reshape(-1, seq_len)
+        return x.to(self.device, non_blocking=True), y.to(self.device, non_blocking=True)
+
+# -----------------------------
+# TRANSFORMER MODULES
+# -----------------------------
+
+class RMSNorm(nn.Module):
+    def __init__(self, eps: float | None = None):
+        super().__init__()
+        self.eps = eps
+
+    def forward(self, x: Tensor) -> Tensor:
+        return F.rms_norm(x, (x.size(-1),), eps=self.eps)
+
+
+class CastedLinear(nn.Linear):
+    # Keep weights in fp32 for optimizer/state quality, cast at matmul time for bf16 compute.
+    def forward(self, x: Tensor) -> Tensor:
+        bias = self.bias.to(x.dtype) if self.bias is not None else None
+        return F.linear(x, self.weight.to(x.dtype), bias)
+
+
+def restore_low_dim_params_to_fp32(module: nn.Module) -> None:
+    # Keep small/control parameters in fp32 even when the model body runs in bf16.
+    with torch.no_grad():
+        for name, param in module.named_parameters():
+            if (param.ndim < 2 or any(pattern in name for pattern in CONTROL_TENSOR_NAME_PATTERNS)) and param.dtype != torch.float32:
+                param.data = param.data.float()
+
+
+class Rotary(nn.Module):
+    # Caches cos/sin tables per sequence length on the current device.
+    def __init__(self, dim: int, base: float = 10000.0):
+        super().__init__()
+        inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2, dtype=torch.float32) / dim))
+        self.register_buffer("inv_freq", inv_freq, persistent=False)
+        self._seq_len_cached = 0
+        self._cos_cached: Tensor | None = None
+        self._sin_cached: Tensor | None = None
+
+    def forward(self, seq_len: int, device: torch.device, dtype: torch.dtype) -> tuple[Tensor, Tensor]:
+        if (
+            self._cos_cached is None
+            or self._sin_cached is None
+            or self._seq_len_cached != seq_len
+            or self._cos_cached.device != device
+        ):
+            t = torch.arange(seq_len, device=device, dtype=self.inv_freq.dtype)
+            freqs = torch.outer(t, self.inv_freq.to(device))
+            self._cos_cached = freqs.cos()[None, None, :, :]
+            self._sin_cached = freqs.sin()[None, None, :, :]
+            self._seq_len_cached = seq_len
+        return self._cos_cached.to(dtype=dtype), self._sin_cached.to(dtype=dtype)
+
+
+def apply_rotary_emb(x: Tensor, cos: Tensor, sin: Tensor) -> Tensor:
+    half = x.size(-1) // 2
+    x1, x2 = x[..., :half], x[..., half:]
+    return torch.cat((x1 * cos + x2 * sin, x1 * (-sin) + x2 * cos), dim=-1)
+
+
+class CausalSelfAttention(nn.Module):
+    def __init__(
+        self,
+        dim: int,
+        num_heads: int,
+        num_kv_heads: int,
+        rope_base: float,
+        qk_gain_init: float,
+    ):
+        super().__init__()
+        if dim % num_heads != 0:
+            raise ValueError("model_dim must be divisible by num_heads")
+        if num_heads % num_kv_heads != 0:
+            raise ValueError("num_heads must be divisible by num_kv_heads")
+        self.num_heads = num_heads
+        self.num_kv_heads = num_kv_heads
+        self.head_dim = dim // num_heads
+        if self.head_dim % 2 != 0:
+            raise ValueError("head_dim must be even for RoPE")
+        kv_dim = self.num_kv_heads * self.head_dim
+        self.c_q = CastedLinear(dim, dim, bias=False)
+        self.c_k = CastedLinear(dim, kv_dim, bias=False)
+        self.c_v = CastedLinear(dim, kv_dim, bias=False)
+        self.proj = CastedLinear(dim, dim, bias=False)
+        self.proj._zero_init = True
+        self.q_gain = nn.Parameter(torch.full((num_heads,), qk_gain_init, dtype=torch.float32))
+        self.rotary = Rotary(self.head_dim, base=rope_base)
+
+    def forward(self, x: Tensor) -> Tensor:
+        bsz, seqlen, dim = x.shape
+        q = self.c_q(x).reshape(bsz, seqlen, self.num_heads, self.head_dim).transpose(1, 2)
+        k = self.c_k(x).reshape(bsz, seqlen, self.num_kv_heads, self.head_dim).transpose(1, 2)
+        v = self.c_v(x).reshape(bsz, seqlen, self.num_kv_heads, self.head_dim).transpose(1, 2)
+        q = F.rms_norm(q, (q.size(-1),))
+        k = F.rms_norm(k, (k.size(-1),))
+        cos, sin = self.rotary(seqlen, x.device, q.dtype)
+        q = apply_rotary_emb(q, cos, sin)
+        k = apply_rotary_emb(k, cos, sin)
+        q = q * self.q_gain.to(dtype=q.dtype)[None, :, None, None]
+        y = F.scaled_dot_product_attention(
+            q,
+            k,
+            v,
+            attn_mask=None,
+            is_causal=True,
+            enable_gqa=(self.num_kv_heads != self.num_heads),
+        )
+        y = y.transpose(1, 2).contiguous().reshape(bsz, seqlen, dim)
+        return self.proj(y)
+
+
+class MLP(nn.Module):
+    # relu^2 MLP from the original modded-nanogpt setup
+    def __init__(self, dim: int, mlp_mult: int):
+        super().__init__()
+        hidden = mlp_mult * dim
+        self.fc = CastedLinear(dim, hidden, bias=False)
+        self.proj = CastedLinear(hidden, dim, bias=False)
+        self.proj._zero_init = True
+
+    def forward(self, x: Tensor) -> Tensor:
+        x = torch.relu(self.fc(x))
+        return self.proj(x.square())
+
+
+class Block(nn.Module):
+    def __init__(
+        self,
+        dim: int,
+        num_heads: int,
+        num_kv_heads: int,
+        mlp_mult: int,
+        rope_base: float,
+        qk_gain_init: float,
+    ):
+        super().__init__()
+        self.attn_norm = RMSNorm()
+        self.mlp_norm = RMSNorm()
+        self.attn = CausalSelfAttention(dim, num_heads, num_kv_heads, rope_base, qk_gain_init)
+        self.mlp = MLP(dim, mlp_mult)
+        self.attn_scale = nn.Parameter(torch.ones(dim, dtype=torch.float32))
+        self.mlp_scale = nn.Parameter(torch.ones(dim, dtype=torch.float32))
+        self.resid_mix = nn.Parameter(torch.stack((torch.ones(dim), torch.zeros(dim))).float())
+
+    def forward(self, x: Tensor, x0: Tensor) -> Tensor:
+        mix = self.resid_mix.to(dtype=x.dtype)
+        x = mix[0][None, None, :] * x + mix[1][None, None, :] * x0
+        attn_out = self.attn(self.attn_norm(x))
+        x = x + self.attn_scale.to(dtype=x.dtype)[None, None, :] * attn_out
+        x = x + self.mlp_scale.to(dtype=x.dtype)[None, None, :] * self.mlp(self.mlp_norm(x))
+        return x
+
+
+class GPT(nn.Module):
+    def __init__(
+        self,
+        vocab_size: int,
+        num_layers: int,
+        model_dim: int,
+        num_heads: int,
+        num_kv_heads: int,
+        mlp_mult: int,
+        tie_embeddings: bool,
+        tied_embed_init_std: float,
+        logit_softcap: float,
+        rope_base: float,
+        qk_gain_init: float,
+    ):
+        super().__init__()
+        if logit_softcap <= 0.0:
+            raise ValueError(f"logit_softcap must be positive, got {logit_softcap}")
+        self.tie_embeddings = tie_embeddings
+        self.tied_embed_init_std = tied_embed_init_std
+        self.logit_softcap = logit_softcap
+        self.tok_emb = nn.Embedding(vocab_size, model_dim)
+        self.num_encoder_layers = num_layers // 2
+        self.num_decoder_layers = num_layers - self.num_encoder_layers
+        self.num_skip_weights = min(self.num_encoder_layers, self.num_decoder_layers)
+        self.skip_weights = nn.Parameter(torch.ones(self.num_skip_weights, model_dim, dtype=torch.float32))
+        self.blocks = nn.ModuleList(
+            [
+                Block(
+                    model_dim,
+                    num_heads,
+                    num_kv_heads,
+                    mlp_mult,
+                    rope_base,
+                    qk_gain_init,
+                )
+                for i in range(num_layers)
+            ]
+        )
+        self.final_norm = RMSNorm()
+        self.lm_head = None if tie_embeddings else CastedLinear(model_dim, vocab_size, bias=False)
+        if self.lm_head is not None:
+            self.lm_head._zero_init = True
+        self._init_weights()
+
+    def _init_weights(self) -> None:
+        if self.tie_embeddings:
+            nn.init.normal_(self.tok_emb.weight, mean=0.0, std=self.tied_embed_init_std)
+        for module in self.modules():
+            if isinstance(module, nn.Linear) and getattr(module, "_zero_init", False):
+                nn.init.zeros_(module.weight)
+
+    def forward(self, input_ids: Tensor, target_ids: Tensor) -> Tensor:
+        x = self.tok_emb(input_ids)
+        x = F.rms_norm(x, (x.size(-1),))
+        x0 = x
+        skips: list[Tensor] = []
+
+        # First half stores skips; second half reuses them in reverse order.
+        for i in range(self.num_encoder_layers):
+            x = self.blocks[i](x, x0)
+            skips.append(x)
+        for i in range(self.num_decoder_layers):
+            if skips:
+                x = x + self.skip_weights[i].to(dtype=x.dtype)[None, None, :] * skips.pop()
+            x = self.blocks[self.num_encoder_layers + i](x, x0)
+
+        x = self.final_norm(x).reshape(-1, x.size(-1))
+        targets = target_ids.reshape(-1)
+        if self.tie_embeddings:
+            logits_proj = F.linear(x, self.tok_emb.weight)
+        else:
+            if self.lm_head is None:
+                raise RuntimeError("lm_head is required when tie_embeddings=False")
+            logits_proj = self.lm_head(x)
+        logits = self.logit_softcap * torch.tanh(logits_proj / self.logit_softcap)
+        return F.cross_entropy(logits.float(), targets, reduction="mean")
+
+
+# -----------------------------
+# TRAINING
+# -----------------------------
+
+def main() -> None:
+    global zeropower_via_newtonschulz5
+
+    code = Path(__file__).read_text(encoding="utf-8")
+    args = Hyperparameters()
+    zeropower_via_newtonschulz5 = torch.compile(zeropower_via_newtonschulz5)
+
+    # -----------------------------
+    # DISTRIBUTED + CUDA SETUP
+    # -----------------------------
+
+    distributed = "RANK" in os.environ and "WORLD_SIZE" in os.environ
+    rank = int(os.environ.get("RANK", "0"))
+    world_size = int(os.environ.get("WORLD_SIZE", "1"))
+    local_rank = int(os.environ.get("LOCAL_RANK", "0"))
+    if world_size <= 0:
+        raise ValueError(f"WORLD_SIZE must be positive, got {world_size}")
+    if 8 % world_size != 0:
+        raise ValueError(f"WORLD_SIZE={world_size} must divide 8 so grad_accum_steps stays integral")
+    grad_accum_steps = 8 // world_size
+    grad_scale = 1.0 / grad_accum_steps
+    if not torch.cuda.is_available():
+        raise RuntimeError("CUDA is required")
+    device = torch.device("cuda", local_rank)
+    torch.cuda.set_device(device)
+    if distributed:
+        dist.init_process_group(backend="nccl", device_id=device)
+        dist.barrier()
+    master_process = rank == 0
+
+    # Fast math knobs
+    torch.backends.cuda.matmul.allow_tf32 = True
+    torch.backends.cudnn.allow_tf32 = True
+    from torch.backends.cuda import enable_cudnn_sdp, enable_flash_sdp, enable_math_sdp, enable_mem_efficient_sdp
+
+    enable_cudnn_sdp(False)
+    enable_flash_sdp(True)
+    enable_mem_efficient_sdp(False)
+    enable_math_sdp(False)
+
+    logfile = None
+    if master_process:
+        os.makedirs("logs", exist_ok=True)
+        logfile = f"logs/{args.run_id}.txt"
+        print(logfile)
+
+    def log0(msg: str, console: bool = True) -> None:
+        if not master_process:
+            return
+        if console:
+            print(msg)
+        if logfile is not None:
+            with open(logfile, "a", encoding="utf-8") as f:
+                print(msg, file=f)
+
+    log0(code, console=False)
+    log0("=" * 100, console=False)
+    log0(f"Running Python {sys.version}", console=False)
+    log0(f"Running PyTorch {torch.__version__}", console=False)
+    log0(
+        subprocess.run(["nvidia-smi"], stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True, check=False).stdout,
+        console=False,
+    )
+    log0("=" * 100, console=False)
+
+    # -----------------------------
+    # TOKENIZER + VALIDATION METRIC SETUP
+    # -----------------------------
+
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    torch.manual_seed(args.seed)
+    torch.cuda.manual_seed_all(args.seed)
+
+    if not args.tokenizer_path.endswith(".model"):
+        raise ValueError(f"Script only setup for SentencePiece .model file: {args.tokenizer_path}")
+    sp = spm.SentencePieceProcessor(model_file=args.tokenizer_path)
+    if int(sp.vocab_size()) != args.vocab_size:
+        raise ValueError(
+            f"VOCAB_SIZE={args.vocab_size} does not match tokenizer vocab_size={int(sp.vocab_size())}"
+        )
+    dataset_dir = Path(args.data_path).resolve()
+    actual_train_files = len(list(dataset_dir.glob("fineweb_train_*.bin")))
+    val_tokens = load_validation_tokens(args.val_files, args.train_seq_len)
+    base_bytes_lut, has_leading_space_lut, is_boundary_token_lut = build_sentencepiece_luts(
+        sp, args.vocab_size, device
+    )
+    log0(f"val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path={args.tokenizer_path}")
+    log0(f"train_loader:dataset:{dataset_dir.name} train_shards:{actual_train_files}")
+    log0(f"val_loader:shards pattern={args.val_files} tokens:{val_tokens.numel() - 1}")
+
+    # -----------------------------
+    # MODEL + OPTIMIZER SETUP
+    # -----------------------------
+
+    base_model = GPT(
+        vocab_size=args.vocab_size,
+        num_layers=args.num_layers,
+        model_dim=args.model_dim,
+        num_heads=args.num_heads,
+        num_kv_heads=args.num_kv_heads,
+        mlp_mult=args.mlp_mult,
+        tie_embeddings=args.tie_embeddings,
+        tied_embed_init_std=args.tied_embed_init_std,
+        logit_softcap=args.logit_softcap,
+        rope_base=args.rope_base,
+        qk_gain_init=args.qk_gain_init,
+    ).to(device).bfloat16()
+    for module in base_model.modules():
+        if isinstance(module, CastedLinear):
+            module.float()
+    restore_low_dim_params_to_fp32(base_model)
+    compiled_model = torch.compile(base_model, dynamic=False, fullgraph=True)
+    model: nn.Module = DDP(compiled_model, device_ids=[local_rank], broadcast_buffers=False) if distributed else compiled_model
+
+    # Optimizer split:
+    # - token embedding (Adam) uses EMBED_LR
+    # - untied lm_head (Adam) uses HEAD_LR
+    # - matrix params in transformer blocks use MATRIX_LR via Muon
+    # - vectors/scalars use SCALAR_LR via Adam
+    block_named_params = list(base_model.blocks.named_parameters())
+    matrix_params = [
+        p
+        for name, p in block_named_params
+        if p.ndim == 2 and not any(pattern in name for pattern in CONTROL_TENSOR_NAME_PATTERNS)
+    ]
+    scalar_params = [
+        p
+        for name, p in block_named_params
+        if p.ndim < 2 or any(pattern in name for pattern in CONTROL_TENSOR_NAME_PATTERNS)
+    ]
+    if base_model.skip_weights.numel() > 0:
+        scalar_params.append(base_model.skip_weights)
+    token_lr = args.tied_embed_lr if args.tie_embeddings else args.embed_lr
+    optimizer_tok = torch.optim.Adam(
+        [{"params": [base_model.tok_emb.weight], "lr": token_lr, "base_lr": token_lr}],
+        betas=(args.beta1, args.beta2),
+        eps=args.adam_eps,
+        fused=True,
+    )
+    optimizer_muon = Muon(
+        matrix_params,
+        lr=args.matrix_lr,
+        momentum=args.muon_momentum,
+        backend_steps=args.muon_backend_steps,
+    )
+    for group in optimizer_muon.param_groups:
+        group["base_lr"] = args.matrix_lr
+    optimizer_scalar = torch.optim.Adam(
+        [{"params": scalar_params, "lr": args.scalar_lr, "base_lr": args.scalar_lr}],
+        betas=(args.beta1, args.beta2),
+        eps=args.adam_eps,
+        fused=True,
+    )
+    optimizers: list[torch.optim.Optimizer] = [optimizer_tok, optimizer_muon, optimizer_scalar]
+    if base_model.lm_head is not None:
+        optimizer_head = torch.optim.Adam(
+            [{"params": [base_model.lm_head.weight], "lr": args.head_lr, "base_lr": args.head_lr}],
+            betas=(args.beta1, args.beta2),
+            eps=args.adam_eps,
+            fused=True,
+        )
+        optimizers.insert(1, optimizer_head)
+
+    n_params = sum(p.numel() for p in base_model.parameters())
+    log0(f"model_params:{n_params}")
+    log0(f"world_size:{world_size} grad_accum_steps:{grad_accum_steps}")
+    log0("sdp_backends:cudnn=False flash=True mem_efficient=False math=False")
+    log0(f"attention_mode:gqa num_heads:{args.num_heads} num_kv_heads:{args.num_kv_heads}")
+    log0(
+        f"tie_embeddings:{args.tie_embeddings} embed_lr:{token_lr} "
+        f"head_lr:{args.head_lr if base_model.lm_head is not None else 0.0} "
+        f"matrix_lr:{args.matrix_lr} scalar_lr:{args.scalar_lr}"
+    )
+    log0(
+        f"train_batch_tokens:{args.train_batch_tokens} train_seq_len:{args.train_seq_len} "
+        f"iterations:{args.iterations} warmup_steps:{args.warmup_steps} "
+        f"max_wallclock_seconds:{args.max_wallclock_seconds:.3f}"
+    )
+    log0(f"seed:{args.seed}")
+
+    # -----------------------------
+    # DATA LOADER & MODEL WARMUP
+    # -----------------------------
+
+    train_loader = DistributedTokenLoader(args.train_files, rank, world_size, device)
+
+    def zero_grad_all() -> None:
+        for opt in optimizers:
+            opt.zero_grad(set_to_none=True)
+
+    max_wallclock_ms = 1000.0 * args.max_wallclock_seconds if args.max_wallclock_seconds > 0 else None
+
+    def lr_mul(step: int, elapsed_ms: float) -> float:
+        if args.warmdown_iters <= 0:
+            return 1.0
+        if max_wallclock_ms is None:
+            warmdown_start = max(args.iterations - args.warmdown_iters, 0)
+            return max((args.iterations - step) / max(args.warmdown_iters, 1), 0.0) if warmdown_start <= step < args.iterations else 1.0
+        step_ms = elapsed_ms / max(step, 1)
+        warmdown_ms = args.warmdown_iters * step_ms
+        remaining_ms = max(max_wallclock_ms - elapsed_ms, 0.0)
+        return remaining_ms / max(warmdown_ms, 1e-9) if remaining_ms <= warmdown_ms else 1.0
+
+    # Warmup primes the compiled forward/backward/optimizer paths, then we restore the
+    # initial weights/optimizer state so measured training starts from the true init.
+    if args.warmup_steps > 0:
+        initial_model_state = {name: tensor.detach().cpu().clone() for name, tensor in base_model.state_dict().items()}
+        initial_optimizer_states = [copy.deepcopy(opt.state_dict()) for opt in optimizers]
+        model.train()
+        for warmup_step in range(args.warmup_steps):
+            zero_grad_all()
+            for micro_step in range(grad_accum_steps):
+                if distributed:
+                    model.require_backward_grad_sync = micro_step == grad_accum_steps - 1
+                x, y = train_loader.next_batch(args.train_batch_tokens, args.train_seq_len, grad_accum_steps)
+                with torch.autocast(device_type="cuda", dtype=torch.bfloat16, enabled=True):
+                    warmup_loss = model(x, y)
+                (warmup_loss * grad_scale).backward()
+            for opt in optimizers:
+                opt.step()
+            zero_grad_all()
+            if args.warmup_steps <= 20 or (warmup_step + 1) % 10 == 0 or warmup_step + 1 == args.warmup_steps:
+                log0(f"warmup_step:{warmup_step + 1}/{args.warmup_steps}")
+        base_model.load_state_dict(initial_model_state, strict=True)
+        for opt, state in zip(optimizers, initial_optimizer_states, strict=True):
+            opt.load_state_dict(state)
+        zero_grad_all()
+        if distributed:
+            model.require_backward_grad_sync = True
+        train_loader = DistributedTokenLoader(args.train_files, rank, world_size, device)
+
+    # -----------------------------
+    # MAIN TRAINING LOOP
+    # -----------------------------
+
+    training_time_ms = 0.0
+    stop_after_step: int | None = None
+    torch.cuda.synchronize()
+    t0 = time.perf_counter()
+
+    step = 0
+    while True:
+        last_step = step == args.iterations or (stop_after_step is not None and step >= stop_after_step)
+
+        should_validate = last_step or (args.val_loss_every > 0 and step % args.val_loss_every == 0)
+        if should_validate:
+            torch.cuda.synchronize()
+            training_time_ms += 1000.0 * (time.perf_counter() - t0)
+            val_loss, val_bpb = eval_val(
+                args,
+                model,
+                rank,
+                world_size,
+                device,
+                grad_accum_steps,
+                val_tokens,
+                base_bytes_lut,
+                has_leading_space_lut,
+                is_boundary_token_lut,
+            )
+            log0(
+                f"step:{step}/{args.iterations} val_loss:{val_loss:.4f} val_bpb:{val_bpb:.4f} "
+                f"train_time:{training_time_ms:.0f}ms step_avg:{training_time_ms / max(step, 1):.2f}ms"
+            )
+            torch.cuda.synchronize()
+            t0 = time.perf_counter()
+
+        if last_step:
+            if stop_after_step is not None and step < args.iterations:
+                log0(
+                    f"stopping_early: wallclock_cap train_time:{training_time_ms:.0f}ms "
+                    f"step:{step}/{args.iterations}"
+                )
+            break
+
+        elapsed_ms = training_time_ms + 1000.0 * (time.perf_counter() - t0)
+        scale = lr_mul(step, elapsed_ms)
+        zero_grad_all()
+        train_loss = torch.zeros((), device=device)
+        for micro_step in range(grad_accum_steps):
+            if distributed:
+                model.require_backward_grad_sync = micro_step == grad_accum_steps - 1
+            x, y = train_loader.next_batch(args.train_batch_tokens, args.train_seq_len, grad_accum_steps)
+            with torch.autocast(device_type="cuda", dtype=torch.bfloat16, enabled=True):
+                loss = model(x, y)
+            train_loss += loss.detach()
+            (loss * grad_scale).backward()
+        train_loss /= grad_accum_steps
+
+        frac = min(step / args.muon_momentum_warmup_steps, 1.0) if args.muon_momentum_warmup_steps > 0 else 1.0
+        muon_momentum = (1 - frac) * args.muon_momentum_warmup_start + frac * args.muon_momentum
+        for group in optimizer_muon.param_groups:
+            group["momentum"] = muon_momentum
+
+        for opt in optimizers:
+            for group in opt.param_groups:
+                group["lr"] = group["base_lr"] * scale
+
+        if args.grad_clip_norm > 0:
+            torch.nn.utils.clip_grad_norm_(base_model.parameters(), args.grad_clip_norm)
+        for opt in optimizers:
+            opt.step()
+        zero_grad_all()
+
+        step += 1
+        approx_training_time_ms = training_time_ms + 1000.0 * (time.perf_counter() - t0)
+        should_log_train = (
+            args.train_log_every > 0
+            and (step <= 10 or step % args.train_log_every == 0 or stop_after_step is not None)
+        )
+        if should_log_train:
+            log0(
+                f"step:{step}/{args.iterations} train_loss:{train_loss.item():.4f} "
+                f"train_time:{approx_training_time_ms:.0f}ms step_avg:{approx_training_time_ms / step:.2f}ms"
+            )
+
+        # Needed to sync whether we've reached the wallclock cap.
+        reached_cap = max_wallclock_ms is not None and approx_training_time_ms >= max_wallclock_ms
+        if distributed and max_wallclock_ms is not None:
+            reached_cap_tensor = torch.tensor(int(reached_cap), device=device)
+            dist.all_reduce(reached_cap_tensor, op=dist.ReduceOp.MAX)
+            reached_cap = bool(reached_cap_tensor.item())
+        if stop_after_step is None and reached_cap:
+            stop_after_step = step
+
+    log0(
+        f"peak memory allocated: {torch.cuda.max_memory_allocated() // 1024 // 1024} MiB "
+        f"reserved: {torch.cuda.max_memory_reserved() // 1024 // 1024} MiB"
+    )
+
+    # -----------------------------
+    # SERIALIZATION + ROUNDTRIP VALIDATION
+    # -----------------------------
+    # Save the raw state (useful for debugging/loading in PyTorch directly), then always produce
+    # the compressed int8+zlib artifact and validate the round-tripped weights.
+
+    if master_process:
+        torch.save(base_model.state_dict(), "final_model.pt")
+        model_bytes = os.path.getsize("final_model.pt")
+        code_bytes = len(code.encode("utf-8"))
+        log0(f"Serialized model: {model_bytes} bytes")
+        log0(f"Code size: {code_bytes} bytes")
+        log0(f"Total submission size: {model_bytes + code_bytes} bytes")
+
+    quant_obj, quant_stats = quantize_state_dict_int8(base_model.state_dict())
+    quant_buf = io.BytesIO()
+    torch.save(quant_obj, quant_buf)
+    quant_raw = quant_buf.getvalue()
+    quant_blob = zlib.compress(quant_raw, level=9)
+    quant_raw_bytes = len(quant_raw)
+    if master_process:
+        with open("final_model.int8.ptz", "wb") as f:
+            f.write(quant_blob)
+        quant_file_bytes = os.path.getsize("final_model.int8.ptz")
+        code_bytes = len(code.encode("utf-8"))
+        ratio = quant_stats["baseline_tensor_bytes"] / max(quant_stats["int8_payload_bytes"], 1)
+        log0(
+            f"Serialized model int8+zlib: {quant_file_bytes} bytes "
+            f"(payload:{quant_stats['int8_payload_bytes']} raw_torch:{quant_raw_bytes} payload_ratio:{ratio:.2f}x)"
+        )
+        log0(f"Total submission size int8+zlib: {quant_file_bytes + code_bytes} bytes")
+
+    if distributed:
+        dist.barrier()
+    with open("final_model.int8.ptz", "rb") as f:
+        quant_blob_disk = f.read()
+    quant_state = torch.load(io.BytesIO(zlib.decompress(quant_blob_disk)), map_location="cpu")
+    base_model.load_state_dict(dequantize_state_dict_int8(quant_state), strict=True)
+    torch.cuda.synchronize()
+    t_qeval = time.perf_counter()
+    q_val_loss, q_val_bpb = eval_val(
+        args,
+        model,
+        rank,
+        world_size,
+        device,
+        grad_accum_steps,
+        val_tokens,
+        base_bytes_lut,
+        has_leading_space_lut,
+        is_boundary_token_lut,
+    )
+    torch.cuda.synchronize()
+    log0(
+        f"final_int8_zlib_roundtrip val_loss:{q_val_loss:.4f} val_bpb:{q_val_bpb:.4f} "
+        f"eval_time:{1000.0 * (time.perf_counter() - t_qeval):.0f}ms"
+    )
+    log0(f"final_int8_zlib_roundtrip_exact val_loss:{q_val_loss:.8f} val_bpb:{q_val_bpb:.8f}")
+
+    if distributed:
+        dist.destroy_process_group()
+
+
+if __name__ == "__main__":
+    main()
diff --git a/requirements.txt b/requirements.txt
new file mode 100644
index 0000000..0c5eedc
--- /dev/null
+++ b/requirements.txt
@@ -0,0 +1,10 @@
+numpy
+tqdm
+torch==2.10
+huggingface-hub
+kernels
+setuptools
+typing-extensions==4.15.0
+datasets
+tiktoken
+sentencepiece
+\ No newline at end of file
diff --git a/scripts/replace_hf_dataset_with_export.py b/scripts/replace_hf_dataset_with_export.py
new file mode 100755
index 0000000..4934755
--- /dev/null
+++ b/scripts/replace_hf_dataset_with_export.py
@@ -0,0 +1,116 @@
+#!/usr/bin/env python3
+"""Replace challenge dataset artifacts in a Hugging Face dataset repo with a local export."""
+
+from __future__ import annotations
+
+import argparse
+from pathlib import Path
+
+from huggingface_hub import HfApi
+
+
+DEFAULT_REPO_ID = "willdepueoai/parameter-golf"
+DEFAULT_PATH_IN_REPO = "datasets"
+DATA_ARTIFACT_NAMES = {
+    "datasets",
+    "tokenizers",
+    "manifest.json",
+    "docs_selected.jsonl",
+    "docs_selected.source_manifest.json",
+    "tokenizer_config.export.json",
+    "snapshot_meta.json",
+}
+
+
+def repo_path(prefix: str, name: str) -> str:
+    return f"{prefix}/{name}" if prefix else name
+
+
+def build_parser() -> argparse.ArgumentParser:
+    parser = argparse.ArgumentParser(description="Replace old dataset artifacts in a HF dataset repo with a local export")
+    parser.add_argument("--repo-id", default=DEFAULT_REPO_ID)
+    parser.add_argument("--local-export-root", required=True)
+    parser.add_argument("--path-in-repo", default=DEFAULT_PATH_IN_REPO, help="Subdirectory inside the dataset repo")
+    parser.add_argument("--repo-type", default="dataset")
+    parser.add_argument("--revision", default=None)
+    parser.add_argument("--commit-message", default="Replace dataset export")
+    parser.add_argument("--dry-run", action="store_true")
+    return parser
+
+
+def main() -> None:
+    args = build_parser().parse_args()
+    api = HfApi()
+    local_export_root = Path(args.local_export_root).expanduser().resolve()
+    if not local_export_root.is_dir():
+        raise FileNotFoundError(local_export_root)
+
+    prefix = args.path_in_repo.strip("/")
+    top_level_local = {path.name for path in local_export_root.iterdir()}
+    delete_names = sorted(DATA_ARTIFACT_NAMES | top_level_local)
+    root_entries = {
+        entry.path: entry
+        for entry in api.list_repo_tree(
+            repo_id=args.repo_id,
+            recursive=False,
+            repo_type=args.repo_type,
+            revision=args.revision,
+        )
+    }
+
+    if prefix:
+        if prefix in root_entries:
+            print(f"delete {prefix}")
+            if not args.dry_run:
+                api.delete_folder(
+                    prefix,
+                    repo_id=args.repo_id,
+                    repo_type=args.repo_type,
+                    revision=args.revision,
+                    commit_message=f"Delete {prefix}",
+                )
+
+    remote_entries = root_entries if not prefix else {}
+
+    for name in delete_names:
+        if prefix:
+            break
+        remote_path = repo_path(prefix, name)
+        entry = remote_entries.get(remote_path)
+        if entry is None:
+            continue
+        print(f"delete {remote_path}")
+        if args.dry_run:
+            continue
+        if entry.__class__.__name__ == "RepoFolder":
+            api.delete_folder(
+                remote_path,
+                repo_id=args.repo_id,
+                repo_type=args.repo_type,
+                revision=args.revision,
+                commit_message=f"Delete {remote_path}",
+            )
+        else:
+            api.delete_file(
+                remote_path,
+                repo_id=args.repo_id,
+                repo_type=args.repo_type,
+                revision=args.revision,
+                commit_message=f"Delete {remote_path}",
+            )
+
+    print(f"upload {local_export_root} -> {prefix or '/'}")
+    if args.dry_run:
+        return
+    api.upload_folder(
+        repo_id=args.repo_id,
+        repo_type=args.repo_type,
+        revision=args.revision,
+        folder_path=local_export_root,
+        path_in_repo=prefix or None,
+        commit_message=args.commit_message,
+    )
+
+
+if __name__ == "__main__":
+    main()
diff --git a/scripts/upload_when_ready.sh b/scripts/upload_when_ready.sh
new file mode 100755
index 0000000..570ad0a
--- /dev/null
+++ b/scripts/upload_when_ready.sh
@@ -0,0 +1,23 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+if [[ $# -lt 2 || $# -gt 3 ]]; then
+  echo "usage: $0 <src_root> <dest_root> [poll_seconds]" >&2
+  exit 1
+fi
+
+src_root=$1
+dest_root=$2
+poll_seconds=${3:-120}
+manifest_path="${src_root%/}/manifest.json"
+
+if [[ "$poll_seconds" -le 0 ]]; then
+  echo "poll_seconds must be positive" >&2
+  exit 1
+fi
+
+while [[ ! -f "$manifest_path" ]]; do
+  sleep "$poll_seconds"
+done
+
+bbb cptree "$src_root" "$dest_root"
diff --git a/train_gpt.py b/train_gpt.py
new file mode 100644
index 0000000..0deb056
--- /dev/null
+++ b/train_gpt.py
@@ -0,0 +1,1126 @@
+"""
+The `train_gpt.py` and `train_gpt_mlx.py` scripts are intended as good launching-off points for new participants, not SOTA configs. We'll accept PRs that tune, improve, or simplify these scripts without significantly increasing complexity, but competitive submissions should stay in the `/records` folder.
+
+Hard stop: `train_gpt.py` and `train_gpt_mlx.py` must never be longer than 1500 lines.
+"""
+
+from __future__ import annotations
+
+import copy
+import glob
+import io
+import math
+import os
+import random
+import subprocess
+import sys
+import time
+import uuid
+import zlib
+from pathlib import Path
+
+import numpy as np
+import sentencepiece as spm
+import torch
+import torch.distributed as dist
+import torch.nn.functional as F
+from torch import Tensor, nn
+from torch.nn.parallel import DistributedDataParallel as DDP
+
+# -----------------------------
+# HYPERPARAMETERS
+# -----------------------------
+# Default Simple Baseline run:
+# - 9 transformer blocks at width 512
+# - 8 attention heads with 4 KV heads (GQA) and 2x MLP expansion
+# - vocab size 1024, sequence length 1024, tied embeddings
+# - 524,288 train tokens per step for 20,000 iterations with a ~10 minute cap
+
+class Hyperparameters:
+    # Data paths are shard globs produced by the existing preprocessing pipeline.
+    data_path = os.environ.get("DATA_PATH", "./data/datasets/fineweb10B_sp1024")
+    train_files = os.path.join(data_path, "fineweb_train_*.bin")
+    val_files = os.path.join(data_path, "fineweb_val_*.bin")
+    tokenizer_path = os.environ.get("TOKENIZER_PATH", "./data/tokenizers/fineweb_1024_bpe.model")
+    run_id = os.environ.get("RUN_ID", str(uuid.uuid4()))
+    seed = int(os.environ.get("SEED", 1337))
+
+    # Validation cadence and batch size. Validation always uses the full fineweb_val split.
+    val_batch_size = int(os.environ.get("VAL_BATCH_SIZE", 524_288))
+    val_loss_every = int(os.environ.get("VAL_LOSS_EVERY", 1000))
+    train_log_every = int(os.environ.get("TRAIN_LOG_EVERY", 200))
+
+    # Training length.
+    iterations = int(os.environ.get("ITERATIONS", 20000))
+    warmdown_iters = int(os.environ.get("WARMDOWN_ITERS", 1200))
+    warmup_steps = int(os.environ.get("WARMUP_STEPS", 20))
+    train_batch_tokens = int(os.environ.get("TRAIN_BATCH_TOKENS", 524_288))
+    train_seq_len = int(os.environ.get("TRAIN_SEQ_LEN", 1024))
+    max_wallclock_seconds = float(os.environ.get("MAX_WALLCLOCK_SECONDS", 600.0))
+    qk_gain_init = float(os.environ.get("QK_GAIN_INIT", 1.5))
+
+    # Model shape.
+    vocab_size = int(os.environ.get("VOCAB_SIZE", 1024))
+    num_layers = int(os.environ.get("NUM_LAYERS", 9))
+    num_kv_heads = int(os.environ.get("NUM_KV_HEADS", 4))
+    model_dim = int(os.environ.get("MODEL_DIM", 512))
+    num_heads = int(os.environ.get("NUM_HEADS", 8))
+    mlp_mult = int(os.environ.get("MLP_MULT", 2))
+    tie_embeddings = bool(int(os.environ.get("TIE_EMBEDDINGS", "1")))
+    rope_base = float(os.environ.get("ROPE_BASE", 10000.0))
+    logit_softcap = float(os.environ.get("LOGIT_SOFTCAP", 30.0))
+
+    # Optimizer hyperparameters.
+    embed_lr = float(os.environ.get("EMBED_LR", 0.6))
+    head_lr = float(os.environ.get("HEAD_LR", 0.008))
+    tied_embed_lr = float(os.environ.get("TIED_EMBED_LR", 0.05))
+    tied_embed_init_std = float(os.environ.get("TIED_EMBED_INIT_STD", 0.005))
+    matrix_lr = float(os.environ.get("MATRIX_LR", 0.04))
+    scalar_lr = float(os.environ.get("SCALAR_LR", 0.04))
+    muon_momentum = float(os.environ.get("MUON_MOMENTUM", 0.95))
+    muon_backend_steps = int(os.environ.get("MUON_BACKEND_STEPS", 5))
+    muon_momentum_warmup_start = float(os.environ.get("MUON_MOMENTUM_WARMUP_START", 0.85))
+    muon_momentum_warmup_steps = int(os.environ.get("MUON_MOMENTUM_WARMUP_STEPS", 500))
+    beta1 = float(os.environ.get("BETA1", 0.9))
+    beta2 = float(os.environ.get("BETA2", 0.95))
+    adam_eps = float(os.environ.get("ADAM_EPS", 1e-8))
+    grad_clip_norm = float(os.environ.get("GRAD_CLIP_NORM", 0.0))
+
+# -----------------------------
+# MUON OPTIMIZER 
+# -----------------------------
+# 
+# As borrowed from modded-nanogpt
+# Background on Muon: https://kellerjordan.github.io/posts/muon/
+
+def zeropower_via_newtonschulz5(G: Tensor, steps: int = 10, eps: float = 1e-7) -> Tensor:
+    # Orthogonalize a 2D update matrix with a fast Newton-Schulz iteration.
+    # Muon uses this to normalize matrix-shaped gradients before applying them.
+    a, b, c = (3.4445, -4.7750, 2.0315)
+    X = G.bfloat16()
+    X /= X.norm() + eps
+    transposed = G.size(0) > G.size(1)
+    if transposed:
+        X = X.T
+    for _ in range(steps):
+        A = X @ X.T
+        B = b * A + c * A @ A
+        X = a * X + B @ X
+    return X.T if transposed else X
+
+
+class Muon(torch.optim.Optimizer):
+    def __init__(self, params, lr: float, momentum: float, backend_steps: int, nesterov: bool = True):
+        super().__init__(
+            params,
+            dict(lr=lr, momentum=momentum, backend_steps=backend_steps, nesterov=nesterov),
+        )
+
+    @torch.no_grad()
+    def step(self, closure=None):
+        loss = None
+        if closure is not None:
+            with torch.enable_grad():
+                loss = closure()
+
+        distributed = dist.is_available() and dist.is_initialized()
+        world_size = dist.get_world_size() if distributed else 1
+        rank = dist.get_rank() if distributed else 0
+
+        for group in self.param_groups:
+            params = group["params"]
+            if not params:
+                continue
+            lr = group["lr"]
+            momentum = group["momentum"]
+            backend_steps = group["backend_steps"]
+            nesterov = group["nesterov"]
+
+            total_params = sum(int(p.numel()) for p in params)
+            updates_flat = torch.zeros(total_params, device=params[0].device, dtype=torch.bfloat16)
+
+            curr = 0
+            for i, p in enumerate(params):
+                if i % world_size == rank and p.grad is not None:
+                    g = p.grad
+                    state = self.state[p]
+                    if "momentum_buffer" not in state:
+                        state["momentum_buffer"] = torch.zeros_like(g)
+                    buf = state["momentum_buffer"]
+                    buf.mul_(momentum).add_(g)
+                    if nesterov:
+                        g = g.add(buf, alpha=momentum)
+                    g = zeropower_via_newtonschulz5(g, steps=backend_steps)
+                    # Scale correction from Muon reference implementations.
+                    g *= max(1, g.size(0) / g.size(1)) ** 0.5
+                    updates_flat[curr : curr + p.numel()] = g.reshape(-1)
+                curr += p.numel()
+
+            if distributed:
+                dist.all_reduce(updates_flat, op=dist.ReduceOp.SUM)
+
+            curr = 0
+            for p in params:
+                g = updates_flat[curr : curr + p.numel()].view_as(p).to(dtype=p.dtype)
+                p.add_(g, alpha=-lr)
+                curr += p.numel()
+
+        return loss
+
+
+# -----------------------------
+# TOKENIZER-AGNOSTIC EVALUATION SETUP 
+# -----------------------------
+#
+# It's common for small models have a large fraction of their parameters be embeddings, since the 2 * d_model * d_vocab vectors can be gigantic.
+# Instead of locking the tokenizer, we let you bring your own and calculate our validation metrics on the average compression of the validation set.
+# We calculate BPB (bits-per-byte) instead of validation loss, so we need methods to count the number of bits per token in the tokenizer.
+# Note: Submissions that edit the tokenizer will be examined more carefully, since screwing this up might unjustly improve your score.
+
+def build_sentencepiece_luts(
+    sp: spm.SentencePieceProcessor, vocab_size: int, device: torch.device
+) -> tuple[Tensor, Tensor, Tensor]:
+    sp_vocab_size = int(sp.vocab_size())
+    table_size = max(sp_vocab_size, vocab_size)
+    base_bytes_np = np.zeros((table_size,), dtype=np.int16)
+    has_leading_space_np = np.zeros((table_size,), dtype=np.bool_)
+    is_boundary_token_np = np.ones((table_size,), dtype=np.bool_)
+    for token_id in range(sp_vocab_size):
+        if sp.is_control(token_id) or sp.is_unknown(token_id) or sp.is_unused(token_id):
+            continue
+        is_boundary_token_np[token_id] = False
+        if sp.is_byte(token_id):
+            base_bytes_np[token_id] = 1
+            continue
+        piece = sp.id_to_piece(token_id)
+        if piece.startswith("▁"):
+            has_leading_space_np[token_id] = True
+            piece = piece[1:]
+        base_bytes_np[token_id] = len(piece.encode("utf-8"))
+    return (
+        torch.tensor(base_bytes_np, dtype=torch.int16, device=device),
+        torch.tensor(has_leading_space_np, dtype=torch.bool, device=device),
+        torch.tensor(is_boundary_token_np, dtype=torch.bool, device=device),
+    )
+
+
+def load_validation_tokens(pattern: str, seq_len: int) -> Tensor:
+    files = [Path(p) for p in sorted(glob.glob(pattern))]
+    if not files:
+        raise FileNotFoundError(f"No files found for pattern: {pattern}")
+    # The export pipeline writes the fixed first-50k-doc validation set to fineweb_val_*.
+    tokens = torch.cat([load_data_shard(file) for file in files]).contiguous()
+    usable = ((tokens.numel() - 1) // seq_len) * seq_len
+    if usable <= 0:
+        raise ValueError(f"Validation split is too short for TRAIN_SEQ_LEN={seq_len}")
+    return tokens[: usable + 1]
+
+
+def eval_val(
+    args: Hyperparameters,
+    model: nn.Module,
+    rank: int,
+    world_size: int,
+    device: torch.device,
+    grad_accum_steps: int,
+    val_tokens: Tensor,
+    base_bytes_lut: Tensor,
+    has_leading_space_lut: Tensor,
+    is_boundary_token_lut: Tensor,
+) -> tuple[float, float]:
+    # Validation computes two metrics:
+    # - val_loss: token cross-entropy (natural log)
+    # - val_bpb: tokenizer-agnostic compression metric used by the challenge
+    local_batch_tokens = args.val_batch_size // (world_size * grad_accum_steps)
+    if local_batch_tokens < args.train_seq_len:
+        raise ValueError(
+            "VAL_BATCH_SIZE must provide at least one sequence per rank; "
+            f"got VAL_BATCH_SIZE={args.val_batch_size}, WORLD_SIZE={world_size}, "
+            f"GRAD_ACCUM_STEPS={grad_accum_steps}, TRAIN_SEQ_LEN={args.train_seq_len}"
+        )
+    local_batch_seqs = local_batch_tokens // args.train_seq_len
+    total_seqs = (val_tokens.numel() - 1) // args.train_seq_len
+    seq_start = (total_seqs * rank) // world_size
+    seq_end = (total_seqs * (rank + 1)) // world_size
+    val_loss_sum = torch.zeros((), device=device, dtype=torch.float64)
+    val_token_count = torch.zeros((), device=device, dtype=torch.float64)
+    val_byte_count = torch.zeros((), device=device, dtype=torch.float64)
+
+    model.eval()
+    with torch.inference_mode():
+        for batch_seq_start in range(seq_start, seq_end, local_batch_seqs):
+            batch_seq_end = min(batch_seq_start + local_batch_seqs, seq_end)
+            raw_start = batch_seq_start * args.train_seq_len
+            raw_end = batch_seq_end * args.train_seq_len + 1
+            local = val_tokens[raw_start:raw_end].to(device=device, dtype=torch.int64, non_blocking=True)
+            x = local[:-1].reshape(-1, args.train_seq_len)
+            y = local[1:].reshape(-1, args.train_seq_len)
+            with torch.autocast(device_type="cuda", dtype=torch.bfloat16, enabled=True):
+                batch_loss = model(x, y).detach()
+            batch_token_count = float(y.numel())
+            val_loss_sum += batch_loss.to(torch.float64) * batch_token_count
+            val_token_count += batch_token_count
+            prev_ids = x.reshape(-1)
+            tgt_ids = y.reshape(-1)
+            token_bytes = base_bytes_lut[tgt_ids].to(dtype=torch.int16)
+            token_bytes += (has_leading_space_lut[tgt_ids] & ~is_boundary_token_lut[prev_ids]).to(dtype=torch.int16)
+            val_byte_count += token_bytes.to(torch.float64).sum()
+
+    if dist.is_available() and dist.is_initialized():
+        dist.all_reduce(val_loss_sum, op=dist.ReduceOp.SUM)
+        dist.all_reduce(val_token_count, op=dist.ReduceOp.SUM)
+        dist.all_reduce(val_byte_count, op=dist.ReduceOp.SUM)
+
+    val_loss = val_loss_sum / val_token_count
+    bits_per_token = val_loss.item() / math.log(2.0)
+    tokens_per_byte = val_token_count.item() / val_byte_count.item()
+    model.train()
+    return float(val_loss.item()), float(bits_per_token * tokens_per_byte)
+
+# -----------------------------
+# POST-TRAINING QUANTIZATION
+# -----------------------------
+#
+# It's silly to export our model, which is trained in bf16 and fp32, at that same precision.
+# Instead, we get approximately the same model (with a small hit) by quantizing the model to int8 & zlib compressing.
+# We can then decompress the model and run in higher precision for evaluation, after closing in under the size limit.
+
+CONTROL_TENSOR_NAME_PATTERNS = tuple(
+    pattern
+    for pattern in os.environ.get(
+        "CONTROL_TENSOR_NAME_PATTERNS",
+        "attn_scale,attn_scales,mlp_scale,mlp_scales,resid_mix,resid_mixes,q_gain,skip_weight,skip_weights",
+    ).split(",")
+    if pattern
+)
+INT8_KEEP_FLOAT_FP32_NAME_PATTERNS = tuple(
+    pattern
+    for pattern in os.environ.get(
+        "INT8_KEEP_FLOAT_FP32_NAME_PATTERNS",
+        ",".join(CONTROL_TENSOR_NAME_PATTERNS),
+    ).split(",")
+    if pattern
+)
+INT8_KEEP_FLOAT_MAX_NUMEL = 65_536
+INT8_KEEP_FLOAT_STORE_DTYPE = torch.float16
+INT8_PER_ROW_SCALE_DTYPE = torch.float16
+INT8_CLIP_PERCENTILE = 99.99984
+INT8_CLIP_Q = INT8_CLIP_PERCENTILE / 100.0
+
+def tensor_nbytes(t: Tensor) -> int:
+    return int(t.numel()) * int(t.element_size())
+
+def keep_float_tensor(name: str, t: Tensor, passthrough_orig_dtypes: dict[str, str]) -> Tensor:
+    if any(pattern in name for pattern in INT8_KEEP_FLOAT_FP32_NAME_PATTERNS):
+        return t.float().contiguous()
+    if t.dtype in {torch.float32, torch.bfloat16}:
+        passthrough_orig_dtypes[name] = str(t.dtype).removeprefix("torch.")
+        return t.to(dtype=INT8_KEEP_FLOAT_STORE_DTYPE).contiguous()
+    return t
+
+def quantize_float_tensor(t: Tensor) -> tuple[Tensor, Tensor]:
+    t32 = t.float()
+    if t32.ndim == 2:
+        # Matrices get one scale per row, which usually tracks output-channel
+        # ranges much better than a single tensor-wide scale.
+        clip_abs = (
+            torch.quantile(t32.abs(), INT8_CLIP_Q, dim=1)
+            if t32.numel()
+            else torch.empty((t32.shape[0],), dtype=torch.float32)
+        )
+        clipped = torch.maximum(torch.minimum(t32, clip_abs[:, None]), -clip_abs[:, None])
+        scale = (clip_abs / 127.0).clamp_min(1.0 / 127.0)
+        q = torch.clamp(torch.round(clipped / scale[:, None]), -127, 127).to(torch.int8).contiguous()
+        return q, scale.to(dtype=INT8_PER_ROW_SCALE_DTYPE).contiguous()
+
+    # Vectors / scalars use a simpler per-tensor scale.
+    clip_abs = float(torch.quantile(t32.abs().flatten(), INT8_CLIP_Q).item()) if t32.numel() else 0.0
+    scale = torch.tensor(clip_abs / 127.0 if clip_abs > 0 else 1.0, dtype=torch.float32)
+    q = torch.clamp(torch.round(torch.clamp(t32, -clip_abs, clip_abs) / scale), -127, 127).to(torch.int8).contiguous()
+    return q, scale
+
+def quantize_state_dict_int8(state_dict: dict[str, Tensor]):
+    # Single supported clean-script export format:
+    # - per-row int8 for 2D float tensors
+    # - per-tensor int8 for other float tensors
+    # - exact passthrough for non-floats
+    # - passthrough for small float tensors, stored as fp16 to save bytes
+    quantized: dict[str, Tensor] = {}
+    scales: dict[str, Tensor] = {}
+    dtypes: dict[str, str] = {}
+    passthrough: dict[str, Tensor] = {}
+    passthrough_orig_dtypes: dict[str, str] = {}
+    qmeta: dict[str, dict[str, object]] = {}
+    stats = dict.fromkeys(
+        ("param_count", "num_tensors", "num_float_tensors", "num_nonfloat_tensors", "baseline_tensor_bytes", "int8_payload_bytes"),
+        0,
+    )
+
+    for name, tensor in state_dict.items():
+        t = tensor.detach().to("cpu").contiguous()
+        stats["param_count"] += int(t.numel())
+        stats["num_tensors"] += 1
+        stats["baseline_tensor_bytes"] += tensor_nbytes(t)
+
+        if not t.is_floating_point():
+            stats["num_nonfloat_tensors"] += 1
+            passthrough[name] = t
+            stats["int8_payload_bytes"] += tensor_nbytes(t)
+            continue
+
+        # Small float tensors are cheap enough to keep directly. We still downcast
+        # fp32/bf16 passthrough tensors to fp16 so metadata does not dominate size.
+        if t.numel() <= INT8_KEEP_FLOAT_MAX_NUMEL:
+            kept = keep_float_tensor(name, t, passthrough_orig_dtypes)
+            passthrough[name] = kept
+            stats["int8_payload_bytes"] += tensor_nbytes(kept)
+            continue
+
+        stats["num_float_tensors"] += 1
+        q, s = quantize_float_tensor(t)
+        if s.ndim > 0:
+            qmeta[name] = {"scheme": "per_row", "axis": 0}
+        quantized[name] = q
+        scales[name] = s
+        dtypes[name] = str(t.dtype).removeprefix("torch.")
+        stats["int8_payload_bytes"] += tensor_nbytes(q) + tensor_nbytes(s)
+
+    obj: dict[str, object] = {
+        "__quant_format__": "int8_clean_per_row_v1",
+        "quantized": quantized,
+        "scales": scales,
+        "dtypes": dtypes,
+        "passthrough": passthrough,
+    }
+    if qmeta:
+        obj["qmeta"] = qmeta
+    if passthrough_orig_dtypes:
+        obj["passthrough_orig_dtypes"] = passthrough_orig_dtypes
+    return obj, stats
+
+def dequantize_state_dict_int8(obj: dict[str, object]) -> dict[str, Tensor]:
+    out: dict[str, Tensor] = {}
+    qmeta = obj.get("qmeta", {})
+    passthrough_orig_dtypes = obj.get("passthrough_orig_dtypes", {})
+    for name, q in obj["quantized"].items():
+        dtype = getattr(torch, obj["dtypes"][name])
+        s = obj["scales"][name]
+        if qmeta.get(name, {}).get("scheme") == "per_row" or s.ndim > 0:
+            s = s.to(dtype=torch.float32)
+            # Broadcast the saved row scale back across trailing dimensions.
+            out[name] = (q.float() * s.view(q.shape[0], *([1] * (q.ndim - 1)))).to(dtype=dtype).contiguous()
+        else:
+            scale = float(s.item())
+            out[name] = (q.float() * scale).to(dtype=dtype).contiguous()
+    for name, t in obj["passthrough"].items():
+        # Restore small tensors, undoing the temporary fp16 storage cast if needed.
+        out_t = t.detach().to("cpu").contiguous()
+        orig_dtype = passthrough_orig_dtypes.get(name)
+        if isinstance(orig_dtype, str):
+            out_t = out_t.to(dtype=getattr(torch, orig_dtype)).contiguous()
+        out[name] = out_t
+    return out
+
+
+# -----------------------------
+# DATA LOADING 
+# -----------------------------
+
+def load_data_shard(file: Path) -> Tensor:
+    header_bytes = 256 * np.dtype("<i4").itemsize
+    token_bytes = np.dtype("<u2").itemsize
+    header = np.fromfile(file, dtype="<i4", count=256)
+    # SHARD HEADER INTS & SHARD_MAGIC
+    if header.size != 256 or int(header[0]) != 20240520 or int(header[1]) != 1:
+        raise ValueError(f"Unexpected shard header for {file}")
+    num_tokens = int(header[2])
+    expected_size = header_bytes + num_tokens * token_bytes
+    if file.stat().st_size != expected_size:
+        raise ValueError(f"Shard size mismatch for {file}: expected {expected_size} bytes")
+    tokens_np = np.fromfile(file, dtype="<u2", count=num_tokens, offset=header_bytes)
+    if tokens_np.size != num_tokens:
+        raise ValueError(f"Short read for {file}")
+    return torch.from_numpy(tokens_np.astype(np.uint16, copy=False))
+
+
+class TokenStream:
+    # Reads shards sequentially and wraps around forever. The training loop therefore
+    # has deterministic, simple streaming behavior with no sampling or workers.
+    def __init__(self, pattern: str):
+        self.files = [Path(p) for p in sorted(glob.glob(pattern))]
+        if not self.files:
+            raise FileNotFoundError(f"No files found for pattern: {pattern}")
+        self.file_idx = 0
+        self.tokens = load_data_shard(self.files[0])
+        self.pos = 0
+
+    def _advance_file(self) -> None:
+        self.file_idx = (self.file_idx + 1) % len(self.files)
+        self.tokens = load_data_shard(self.files[self.file_idx])
+        self.pos = 0
+
+    def take(self, n: int) -> Tensor:
+        chunks: list[Tensor] = []
+        remaining = n
+        while remaining > 0:
+            avail = self.tokens.numel() - self.pos
+            if avail <= 0:
+                self._advance_file()
+                continue
+            k = min(remaining, avail)
+            chunks.append(self.tokens[self.pos : self.pos + k])
+            self.pos += k
+            remaining -= k
+        return chunks[0] if len(chunks) == 1 else torch.cat(chunks)
+
+
+class DistributedTokenLoader:
+    # Each call consumes a contiguous chunk from the shared token stream, then slices out
+    # one disjoint span per rank. The extra "+1" token lets us build (x, y) by shifting.
+    def __init__(self, pattern: str, rank: int, world_size: int, device: torch.device):
+        self.rank = rank
+        self.world_size = world_size
+        self.device = device
+        self.stream = TokenStream(pattern)
+
+    def next_batch(self, global_tokens: int, seq_len: int, grad_accum_steps: int) -> tuple[Tensor, Tensor]:
+        local_tokens = global_tokens // (self.world_size * grad_accum_steps)
+        per_rank_span = local_tokens + 1
+        chunk = self.stream.take(per_rank_span * self.world_size)
+        start = self.rank * per_rank_span
+        local = chunk[start : start + per_rank_span].to(dtype=torch.int64)
+        x = local[:-1].reshape(-1, seq_len)
+        y = local[1:].reshape(-1, seq_len)
+        return x.to(self.device, non_blocking=True), y.to(self.device, non_blocking=True)
+
+# -----------------------------
+# TRANSFORMER MODULES
+# -----------------------------
+
+class RMSNorm(nn.Module):
+    def __init__(self, eps: float | None = None):
+        super().__init__()
+        self.eps = eps
+
+    def forward(self, x: Tensor) -> Tensor:
+        return F.rms_norm(x, (x.size(-1),), eps=self.eps)
+
+
+class CastedLinear(nn.Linear):
+    # Keep weights in fp32 for optimizer/state quality, cast at matmul time for bf16 compute.
+    def forward(self, x: Tensor) -> Tensor:
+        bias = self.bias.to(x.dtype) if self.bias is not None else None
+        return F.linear(x, self.weight.to(x.dtype), bias)
+
+
+def restore_low_dim_params_to_fp32(module: nn.Module) -> None:
+    # Keep small/control parameters in fp32 even when the model body runs in bf16.
+    with torch.no_grad():
+        for name, param in module.named_parameters():
+            if (param.ndim < 2 or any(pattern in name for pattern in CONTROL_TENSOR_NAME_PATTERNS)) and param.dtype != torch.float32:
+                param.data = param.data.float()
+
+
+class Rotary(nn.Module):
+    # Caches cos/sin tables per sequence length on the current device.
+    def __init__(self, dim: int, base: float = 10000.0):
+        super().__init__()
+        inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2, dtype=torch.float32) / dim))
+        self.register_buffer("inv_freq", inv_freq, persistent=False)
+        self._seq_len_cached = 0
+        self._cos_cached: Tensor | None = None
+        self._sin_cached: Tensor | None = None
+
+    def forward(self, seq_len: int, device: torch.device, dtype: torch.dtype) -> tuple[Tensor, Tensor]:
+        if (
+            self._cos_cached is None
+            or self._sin_cached is None
+            or self._seq_len_cached != seq_len
+            or self._cos_cached.device != device
+        ):
+            t = torch.arange(seq_len, device=device, dtype=self.inv_freq.dtype)
+            freqs = torch.outer(t, self.inv_freq.to(device))
+            self._cos_cached = freqs.cos()[None, None, :, :]
+            self._sin_cached = freqs.sin()[None, None, :, :]
+            self._seq_len_cached = seq_len
+        return self._cos_cached.to(dtype=dtype), self._sin_cached.to(dtype=dtype)
+
+
+def apply_rotary_emb(x: Tensor, cos: Tensor, sin: Tensor) -> Tensor:
+    half = x.size(-1) // 2
+    x1, x2 = x[..., :half], x[..., half:]
+    return torch.cat((x1 * cos + x2 * sin, x1 * (-sin) + x2 * cos), dim=-1)
+
+
+class CausalSelfAttention(nn.Module):
+    def __init__(
+        self,
+        dim: int,
+        num_heads: int,
+        num_kv_heads: int,
+        rope_base: float,
+        qk_gain_init: float,
+    ):
+        super().__init__()
+        if dim % num_heads != 0:
+            raise ValueError("model_dim must be divisible by num_heads")
+        if num_heads % num_kv_heads != 0:
+            raise ValueError("num_heads must be divisible by num_kv_heads")
+        self.num_heads = num_heads
+        self.num_kv_heads = num_kv_heads
+        self.head_dim = dim // num_heads
+        if self.head_dim % 2 != 0:
+            raise ValueError("head_dim must be even for RoPE")
+        kv_dim = self.num_kv_heads * self.head_dim
+        self.c_q = CastedLinear(dim, dim, bias=False)
+        self.c_k = CastedLinear(dim, kv_dim, bias=False)
+        self.c_v = CastedLinear(dim, kv_dim, bias=False)
+        self.proj = CastedLinear(dim, dim, bias=False)
+        self.proj._zero_init = True
+        self.q_gain = nn.Parameter(torch.full((num_heads,), qk_gain_init, dtype=torch.float32))
+        self.rotary = Rotary(self.head_dim, base=rope_base)
+
+    def forward(self, x: Tensor) -> Tensor:
+        bsz, seqlen, dim = x.shape
+        q = self.c_q(x).reshape(bsz, seqlen, self.num_heads, self.head_dim).transpose(1, 2)
+        k = self.c_k(x).reshape(bsz, seqlen, self.num_kv_heads, self.head_dim).transpose(1, 2)
+        v = self.c_v(x).reshape(bsz, seqlen, self.num_kv_heads, self.head_dim).transpose(1, 2)
+        q = F.rms_norm(q, (q.size(-1),))
+        k = F.rms_norm(k, (k.size(-1),))
+        cos, sin = self.rotary(seqlen, x.device, q.dtype)
+        q = apply_rotary_emb(q, cos, sin)
+        k = apply_rotary_emb(k, cos, sin)
+        q = q * self.q_gain.to(dtype=q.dtype)[None, :, None, None]
+        y = F.scaled_dot_product_attention(
+            q,
+            k,
+            v,
+            attn_mask=None,
+            is_causal=True,
+            enable_gqa=(self.num_kv_heads != self.num_heads),
+        )
+        y = y.transpose(1, 2).contiguous().reshape(bsz, seqlen, dim)
+        return self.proj(y)
+
+
+class MLP(nn.Module):
+    # relu^2 MLP from the original modded-nanogpt setup
+    def __init__(self, dim: int, mlp_mult: int):
+        super().__init__()
+        hidden = mlp_mult * dim
+        self.fc = CastedLinear(dim, hidden, bias=False)
+        self.proj = CastedLinear(hidden, dim, bias=False)
+        self.proj._zero_init = True
+
+    def forward(self, x: Tensor) -> Tensor:
+        x = torch.relu(self.fc(x))
+        return self.proj(x.square())
+
+
+class Block(nn.Module):
+    def __init__(
+        self,
+        dim: int,
+        num_heads: int,
+        num_kv_heads: int,
+        mlp_mult: int,
+        rope_base: float,
+        qk_gain_init: float,
+    ):
+        super().__init__()
+        self.attn_norm = RMSNorm()
+        self.mlp_norm = RMSNorm()
+        self.attn = CausalSelfAttention(dim, num_heads, num_kv_heads, rope_base, qk_gain_init)
+        self.mlp = MLP(dim, mlp_mult)
+        self.attn_scale = nn.Parameter(torch.ones(dim, dtype=torch.float32))
+        self.mlp_scale = nn.Parameter(torch.ones(dim, dtype=torch.float32))
+        self.resid_mix = nn.Parameter(torch.stack((torch.ones(dim), torch.zeros(dim))).float())
+
+    def forward(self, x: Tensor, x0: Tensor) -> Tensor:
+        mix = self.resid_mix.to(dtype=x.dtype)
+        x = mix[0][None, None, :] * x + mix[1][None, None, :] * x0
+        attn_out = self.attn(self.attn_norm(x))
+        x = x + self.attn_scale.to(dtype=x.dtype)[None, None, :] * attn_out
+        x = x + self.mlp_scale.to(dtype=x.dtype)[None, None, :] * self.mlp(self.mlp_norm(x))
+        return x
+
+
+class GPT(nn.Module):
+    def __init__(
+        self,
+        vocab_size: int,
+        num_layers: int,
+        model_dim: int,
+        num_heads: int,
+        num_kv_heads: int,
+        mlp_mult: int,
+        tie_embeddings: bool,
+        tied_embed_init_std: float,
+        logit_softcap: float,
+        rope_base: float,
+        qk_gain_init: float,
+    ):
+        super().__init__()
+        if logit_softcap <= 0.0:
+            raise ValueError(f"logit_softcap must be positive, got {logit_softcap}")
+        self.tie_embeddings = tie_embeddings
+        self.tied_embed_init_std = tied_embed_init_std
+        self.logit_softcap = logit_softcap
+        self.tok_emb = nn.Embedding(vocab_size, model_dim)
+        self.num_encoder_layers = num_layers // 2
+        self.num_decoder_layers = num_layers - self.num_encoder_layers
+        self.num_skip_weights = min(self.num_encoder_layers, self.num_decoder_layers)
+        self.skip_weights = nn.Parameter(torch.ones(self.num_skip_weights, model_dim, dtype=torch.float32))
+        self.blocks = nn.ModuleList(
+            [
+                Block(
+                    model_dim,
+                    num_heads,
+                    num_kv_heads,
+                    mlp_mult,
+                    rope_base,
+                    qk_gain_init,
+                )
+                for i in range(num_layers)
+            ]
+        )
+        self.final_norm = RMSNorm()
+        self.lm_head = None if tie_embeddings else CastedLinear(model_dim, vocab_size, bias=False)
+        if self.lm_head is not None:
+            self.lm_head._zero_init = True
+        self._init_weights()
+
+    def _init_weights(self) -> None:
+        if self.tie_embeddings:
+            nn.init.normal_(self.tok_emb.weight, mean=0.0, std=self.tied_embed_init_std)
+        for module in self.modules():
+            if isinstance(module, nn.Linear) and getattr(module, "_zero_init", False):
+                nn.init.zeros_(module.weight)
+
+    def forward(self, input_ids: Tensor, target_ids: Tensor) -> Tensor:
+        x = self.tok_emb(input_ids)
+        x = F.rms_norm(x, (x.size(-1),))
+        x0 = x
+        skips: list[Tensor] = []
+
+        # First half stores skips; second half reuses them in reverse order.
+        for i in range(self.num_encoder_layers):
+            x = self.blocks[i](x, x0)
+            skips.append(x)
+        for i in range(self.num_decoder_layers):
+            if skips:
+                x = x + self.skip_weights[i].to(dtype=x.dtype)[None, None, :] * skips.pop()
+            x = self.blocks[self.num_encoder_layers + i](x, x0)
+
+        x = self.final_norm(x).reshape(-1, x.size(-1))
+        targets = target_ids.reshape(-1)
+        if self.tie_embeddings:
+            logits_proj = F.linear(x, self.tok_emb.weight)
+        else:
+            if self.lm_head is None:
+                raise RuntimeError("lm_head is required when tie_embeddings=False")
+            logits_proj = self.lm_head(x)
+        logits = self.logit_softcap * torch.tanh(logits_proj / self.logit_softcap)
+        return F.cross_entropy(logits.float(), targets, reduction="mean")
+
+
+# -----------------------------
+# TRAINING
+# -----------------------------
+
+def main() -> None:
+    global zeropower_via_newtonschulz5
+
+    code = Path(__file__).read_text(encoding="utf-8")
+    args = Hyperparameters()
+    zeropower_via_newtonschulz5 = torch.compile(zeropower_via_newtonschulz5)
+
+    # -----------------------------
+    # DISTRIBUTED + CUDA SETUP
+    # -----------------------------
+
+    distributed = "RANK" in os.environ and "WORLD_SIZE" in os.environ
+    rank = int(os.environ.get("RANK", "0"))
+    world_size = int(os.environ.get("WORLD_SIZE", "1"))
+    local_rank = int(os.environ.get("LOCAL_RANK", "0"))
+    if world_size <= 0:
+        raise ValueError(f"WORLD_SIZE must be positive, got {world_size}")
+    if 8 % world_size != 0:
+        raise ValueError(f"WORLD_SIZE={world_size} must divide 8 so grad_accum_steps stays integral")
+    grad_accum_steps = 8 // world_size
+    grad_scale = 1.0 / grad_accum_steps
+    if not torch.cuda.is_available():
+        raise RuntimeError("CUDA is required")
+    device = torch.device("cuda", local_rank)
+    torch.cuda.set_device(device)
+    if distributed:
+        dist.init_process_group(backend="nccl", device_id=device)
+        dist.barrier()
+    master_process = rank == 0
+
+    # Fast math knobs
+    torch.backends.cuda.matmul.allow_tf32 = True
+    torch.backends.cudnn.allow_tf32 = True
+    from torch.backends.cuda import enable_cudnn_sdp, enable_flash_sdp, enable_math_sdp, enable_mem_efficient_sdp
+
+    enable_cudnn_sdp(False)
+    enable_flash_sdp(True)
+    enable_mem_efficient_sdp(False)
+    enable_math_sdp(False)
+
+    logfile = None
+    if master_process:
+        os.makedirs("logs", exist_ok=True)
+        logfile = f"logs/{args.run_id}.txt"
+        print(logfile)
+
+    def log0(msg: str, console: bool = True) -> None:
+        if not master_process:
+            return
+        if console:
+            print(msg)
+        if logfile is not None:
+            with open(logfile, "a", encoding="utf-8") as f:
+                print(msg, file=f)
+
+    log0(code, console=False)
+    log0("=" * 100, console=False)
+    log0(f"Running Python {sys.version}", console=False)
+    log0(f"Running PyTorch {torch.__version__}", console=False)
+    log0(
+        subprocess.run(["nvidia-smi"], stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True, check=False).stdout,
+        console=False,
+    )
+    log0("=" * 100, console=False)
+
+    # -----------------------------
+    # TOKENIZER + VALIDATION METRIC SETUP
+    # -----------------------------
+
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    torch.manual_seed(args.seed)
+    torch.cuda.manual_seed_all(args.seed)
+
+    if not args.tokenizer_path.endswith(".model"):
+        raise ValueError(f"Script only setup for SentencePiece .model file: {args.tokenizer_path}")
+    sp = spm.SentencePieceProcessor(model_file=args.tokenizer_path)
+    if int(sp.vocab_size()) != args.vocab_size:
+        raise ValueError(
+            f"VOCAB_SIZE={args.vocab_size} does not match tokenizer vocab_size={int(sp.vocab_size())}"
+        )
+    dataset_dir = Path(args.data_path).resolve()
+    actual_train_files = len(list(dataset_dir.glob("fineweb_train_*.bin")))
+    val_tokens = load_validation_tokens(args.val_files, args.train_seq_len)
+    base_bytes_lut, has_leading_space_lut, is_boundary_token_lut = build_sentencepiece_luts(
+        sp, args.vocab_size, device
+    )
+    log0(f"val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path={args.tokenizer_path}")
+    log0(f"train_loader:dataset:{dataset_dir.name} train_shards:{actual_train_files}")
+    log0(f"val_loader:shards pattern={args.val_files} tokens:{val_tokens.numel() - 1}")
+
+    # -----------------------------
+    # MODEL + OPTIMIZER SETUP
+    # -----------------------------
+
+    base_model = GPT(
+        vocab_size=args.vocab_size,
+        num_layers=args.num_layers,
+        model_dim=args.model_dim,
+        num_heads=args.num_heads,
+        num_kv_heads=args.num_kv_heads,
+        mlp_mult=args.mlp_mult,
+        tie_embeddings=args.tie_embeddings,
+        tied_embed_init_std=args.tied_embed_init_std,
+        logit_softcap=args.logit_softcap,
+        rope_base=args.rope_base,
+        qk_gain_init=args.qk_gain_init,
+    ).to(device).bfloat16()
+    for module in base_model.modules():
+        if isinstance(module, CastedLinear):
+            module.float()
+    restore_low_dim_params_to_fp32(base_model)
+    compiled_model = torch.compile(base_model, dynamic=False, fullgraph=True)
+    model: nn.Module = DDP(compiled_model, device_ids=[local_rank], broadcast_buffers=False) if distributed else compiled_model
+
+    # Optimizer split:
+    # - token embedding (Adam) uses EMBED_LR
+    # - untied lm_head (Adam) uses HEAD_LR
+    # - matrix params in transformer blocks use MATRIX_LR via Muon
+    # - vectors/scalars use SCALAR_LR via Adam
+    block_named_params = list(base_model.blocks.named_parameters())
+    matrix_params = [
+        p
+        for name, p in block_named_params
+        if p.ndim == 2 and not any(pattern in name for pattern in CONTROL_TENSOR_NAME_PATTERNS)
+    ]
+    scalar_params = [
+        p
+        for name, p in block_named_params
+        if p.ndim < 2 or any(pattern in name for pattern in CONTROL_TENSOR_NAME_PATTERNS)
+    ]
+    if base_model.skip_weights.numel() > 0:
+        scalar_params.append(base_model.skip_weights)
+    token_lr = args.tied_embed_lr if args.tie_embeddings else args.embed_lr
+    optimizer_tok = torch.optim.Adam(
+        [{"params": [base_model.tok_emb.weight], "lr": token_lr, "base_lr": token_lr}],
+        betas=(args.beta1, args.beta2),
+        eps=args.adam_eps,
+        fused=True,
+    )
+    optimizer_muon = Muon(
+        matrix_params,
+        lr=args.matrix_lr,
+        momentum=args.muon_momentum,
+        backend_steps=args.muon_backend_steps,
+    )
+    for group in optimizer_muon.param_groups:
+        group["base_lr"] = args.matrix_lr
+    optimizer_scalar = torch.optim.Adam(
+        [{"params": scalar_params, "lr": args.scalar_lr, "base_lr": args.scalar_lr}],
+        betas=(args.beta1, args.beta2),
+        eps=args.adam_eps,
+        fused=True,
+    )
+    optimizers: list[torch.optim.Optimizer] = [optimizer_tok, optimizer_muon, optimizer_scalar]
+    if base_model.lm_head is not None:
+        optimizer_head = torch.optim.Adam(
+            [{"params": [base_model.lm_head.weight], "lr": args.head_lr, "base_lr": args.head_lr}],
+            betas=(args.beta1, args.beta2),
+            eps=args.adam_eps,
+            fused=True,
+        )
+        optimizers.insert(1, optimizer_head)
+
+    n_params = sum(p.numel() for p in base_model.parameters())
+    log0(f"model_params:{n_params}")
+    log0(f"world_size:{world_size} grad_accum_steps:{grad_accum_steps}")
+    log0("sdp_backends:cudnn=False flash=True mem_efficient=False math=False")
+    log0(f"attention_mode:gqa num_heads:{args.num_heads} num_kv_heads:{args.num_kv_heads}")
+    log0(
+        f"tie_embeddings:{args.tie_embeddings} embed_lr:{token_lr} "
+        f"head_lr:{args.head_lr if base_model.lm_head is not None else 0.0} "
+        f"matrix_lr:{args.matrix_lr} scalar_lr:{args.scalar_lr}"
+    )
+    log0(
+        f"train_batch_tokens:{args.train_batch_tokens} train_seq_len:{args.train_seq_len} "
+        f"iterations:{args.iterations} warmup_steps:{args.warmup_steps} "
+        f"max_wallclock_seconds:{args.max_wallclock_seconds:.3f}"
+    )
+    log0(f"seed:{args.seed}")
+
+    # -----------------------------
+    # DATA LOADER & MODEL WARMUP
+    # -----------------------------
+
+    train_loader = DistributedTokenLoader(args.train_files, rank, world_size, device)
+
+    def zero_grad_all() -> None:
+        for opt in optimizers:
+            opt.zero_grad(set_to_none=True)
+
+    max_wallclock_ms = 1000.0 * args.max_wallclock_seconds if args.max_wallclock_seconds > 0 else None
+
+    def lr_mul(step: int, elapsed_ms: float) -> float:
+        if args.warmdown_iters <= 0:
+            return 1.0
+        if max_wallclock_ms is None:
+            warmdown_start = max(args.iterations - args.warmdown_iters, 0)
+            return max((args.iterations - step) / max(args.warmdown_iters, 1), 0.0) if warmdown_start <= step < args.iterations else 1.0
+        step_ms = elapsed_ms / max(step, 1)
+        warmdown_ms = args.warmdown_iters * step_ms
+        remaining_ms = max(max_wallclock_ms - elapsed_ms, 0.0)
+        return remaining_ms / max(warmdown_ms, 1e-9) if remaining_ms <= warmdown_ms else 1.0
+
+    # Warmup primes the compiled forward/backward/optimizer paths, then we restore the
+    # initial weights/optimizer state so measured training starts from the true init.
+    if args.warmup_steps > 0:
+        initial_model_state = {name: tensor.detach().cpu().clone() for name, tensor in base_model.state_dict().items()}
+        initial_optimizer_states = [copy.deepcopy(opt.state_dict()) for opt in optimizers]
+        model.train()
+        for warmup_step in range(args.warmup_steps):
+            zero_grad_all()
+            for micro_step in range(grad_accum_steps):
+                if distributed:
+                    model.require_backward_grad_sync = micro_step == grad_accum_steps - 1
+                x, y = train_loader.next_batch(args.train_batch_tokens, args.train_seq_len, grad_accum_steps)
+                with torch.autocast(device_type="cuda", dtype=torch.bfloat16, enabled=True):
+                    warmup_loss = model(x, y)
+                (warmup_loss * grad_scale).backward()
+            for opt in optimizers:
+                opt.step()
+            zero_grad_all()
+            if args.warmup_steps <= 20 or (warmup_step + 1) % 10 == 0 or warmup_step + 1 == args.warmup_steps:
+                log0(f"warmup_step:{warmup_step + 1}/{args.warmup_steps}")
+        base_model.load_state_dict(initial_model_state, strict=True)
+        for opt, state in zip(optimizers, initial_optimizer_states, strict=True):
+            opt.load_state_dict(state)
+        zero_grad_all()
+        if distributed:
+            model.require_backward_grad_sync = True
+        train_loader = DistributedTokenLoader(args.train_files, rank, world_size, device)
+
+    # -----------------------------
+    # MAIN TRAINING LOOP
+    # -----------------------------
+
+    training_time_ms = 0.0
+    stop_after_step: int | None = None
+    torch.cuda.synchronize()
+    t0 = time.perf_counter()
+
+    step = 0
+    while True:
+        last_step = step == args.iterations or (stop_after_step is not None and step >= stop_after_step)
+
+        should_validate = last_step or (args.val_loss_every > 0 and step % args.val_loss_every == 0)
+        if should_validate:
+            torch.cuda.synchronize()
+            training_time_ms += 1000.0 * (time.perf_counter() - t0)
+            val_loss, val_bpb = eval_val(
+                args,
+                model,
+                rank,
+                world_size,
+                device,
+                grad_accum_steps,
+                val_tokens,
+                base_bytes_lut,
+                has_leading_space_lut,
+                is_boundary_token_lut,
+            )
+            log0(
+                f"step:{step}/{args.iterations} val_loss:{val_loss:.4f} val_bpb:{val_bpb:.4f} "
+                f"train_time:{training_time_ms:.0f}ms step_avg:{training_time_ms / max(step, 1):.2f}ms"
+            )
+            torch.cuda.synchronize()
+            t0 = time.perf_counter()
+
+        if last_step:
+            if stop_after_step is not None and step < args.iterations:
+                log0(
+                    f"stopping_early: wallclock_cap train_time:{training_time_ms:.0f}ms "
+                    f"step:{step}/{args.iterations}"
+                )
+            break
+
+        elapsed_ms = training_time_ms + 1000.0 * (time.perf_counter() - t0)
+        scale = lr_mul(step, elapsed_ms)
+        zero_grad_all()
+        train_loss = torch.zeros((), device=device)
+        for micro_step in range(grad_accum_steps):
+            if distributed:
+                model.require_backward_grad_sync = micro_step == grad_accum_steps - 1
+            x, y = train_loader.next_batch(args.train_batch_tokens, args.train_seq_len, grad_accum_steps)
+            with torch.autocast(device_type="cuda", dtype=torch.bfloat16, enabled=True):
+                loss = model(x, y)
+            train_loss += loss.detach()
+            (loss * grad_scale).backward()
+        train_loss /= grad_accum_steps
+
+        frac = min(step / args.muon_momentum_warmup_steps, 1.0) if args.muon_momentum_warmup_steps > 0 else 1.0
+        muon_momentum = (1 - frac) * args.muon_momentum_warmup_start + frac * args.muon_momentum
+        for group in optimizer_muon.param_groups:
+            group["momentum"] = muon_momentum
+
+        for opt in optimizers:
+            for group in opt.param_groups:
+                group["lr"] = group["base_lr"] * scale
+
+        if args.grad_clip_norm > 0:
+            torch.nn.utils.clip_grad_norm_(base_model.parameters(), args.grad_clip_norm)
+        for opt in optimizers:
+            opt.step()
+        zero_grad_all()
+
+        step += 1
+        approx_training_time_ms = training_time_ms + 1000.0 * (time.perf_counter() - t0)
+        should_log_train = (
+            args.train_log_every > 0
+            and (step <= 10 or step % args.train_log_every == 0 or stop_after_step is not None)
+        )
+        if should_log_train:
+            log0(
+                f"step:{step}/{args.iterations} train_loss:{train_loss.item():.4f} "
+                f"train_time:{approx_training_time_ms:.0f}ms step_avg:{approx_training_time_ms / step:.2f}ms"
+            )
+
+        # Needed to sync whether we've reached the wallclock cap.
+        reached_cap = max_wallclock_ms is not None and approx_training_time_ms >= max_wallclock_ms
+        if distributed and max_wallclock_ms is not None:
+            reached_cap_tensor = torch.tensor(int(reached_cap), device=device)
+            dist.all_reduce(reached_cap_tensor, op=dist.ReduceOp.MAX)
+            reached_cap = bool(reached_cap_tensor.item())
+        if stop_after_step is None and reached_cap:
+            stop_after_step = step
+
+    log0(
+        f"peak memory allocated: {torch.cuda.max_memory_allocated() // 1024 // 1024} MiB "
+        f"reserved: {torch.cuda.max_memory_reserved() // 1024 // 1024} MiB"
+    )
+
+    # -----------------------------
+    # SERIALIZATION + ROUNDTRIP VALIDATION
+    # -----------------------------
+    # Save the raw state (useful for debugging/loading in PyTorch directly), then always produce
+    # the compressed int8+zlib artifact and validate the round-tripped weights.
+
+    if master_process:
+        torch.save(base_model.state_dict(), "final_model.pt")
+        model_bytes = os.path.getsize("final_model.pt")
+        code_bytes = len(code.encode("utf-8"))
+        log0(f"Serialized model: {model_bytes} bytes")
+        log0(f"Code size: {code_bytes} bytes")
+        log0(f"Total submission size: {model_bytes + code_bytes} bytes")
+
+    quant_obj, quant_stats = quantize_state_dict_int8(base_model.state_dict())
+    quant_buf = io.BytesIO()
+    torch.save(quant_obj, quant_buf)
+    quant_raw = quant_buf.getvalue()
+    quant_blob = zlib.compress(quant_raw, level=9)
+    quant_raw_bytes = len(quant_raw)
+    if master_process:
+        with open("final_model.int8.ptz", "wb") as f:
+            f.write(quant_blob)
+        quant_file_bytes = os.path.getsize("final_model.int8.ptz")
+        code_bytes = len(code.encode("utf-8"))
+        ratio = quant_stats["baseline_tensor_bytes"] / max(quant_stats["int8_payload_bytes"], 1)
+        log0(
+            f"Serialized model int8+zlib: {quant_file_bytes} bytes "
+            f"(payload:{quant_stats['int8_payload_bytes']} raw_torch:{quant_raw_bytes} payload_ratio:{ratio:.2f}x)"
+        )
+        log0(f"Total submission size int8+zlib: {quant_file_bytes + code_bytes} bytes")
+
+    if distributed:
+        dist.barrier()
+    with open("final_model.int8.ptz", "rb") as f:
+        quant_blob_disk = f.read()
+    quant_state = torch.load(io.BytesIO(zlib.decompress(quant_blob_disk)), map_location="cpu")
+    base_model.load_state_dict(dequantize_state_dict_int8(quant_state), strict=True)
+    torch.cuda.synchronize()
+    t_qeval = time.perf_counter()
+    q_val_loss, q_val_bpb = eval_val(
+        args,
+        model,
+        rank,
+        world_size,
+        device,
+        grad_accum_steps,
+        val_tokens,
+        base_bytes_lut,
+        has_leading_space_lut,
+        is_boundary_token_lut,
+    )
+    torch.cuda.synchronize()
+    log0(
+        f"final_int8_zlib_roundtrip val_loss:{q_val_loss:.4f} val_bpb:{q_val_bpb:.4f} "
+        f"eval_time:{1000.0 * (time.perf_counter() - t_qeval):.0f}ms"
+    )
+    log0(f"final_int8_zlib_roundtrip_exact val_loss:{q_val_loss:.8f} val_bpb:{q_val_bpb:.8f}")
+
+    if distributed:
+        dist.destroy_process_group()
+
+
+if __name__ == "__main__":
+    main()
diff --git a/train_gpt_mlx.py b/train_gpt_mlx.py
new file mode 100644
index 0000000..bf7c7d1
--- /dev/null
+++ b/train_gpt_mlx.py
@@ -0,0 +1,1088 @@
+#!/usr/bin/env python3
+"""
+The `train_gpt.py` and `train_gpt_mlx.py` scripts are intended as good launching-off points for new participants, not SOTA configs. We'll accept PRs that tune, improve, or simplify these scripts without significantly increasing complexity, but competitive submissions should stay in the `/records` folder.
+
+Hard stop: `train_gpt.py` and `train_gpt_mlx.py` must never be longer than 1500 lines.
+"""
+from __future__ import annotations
+
+import glob
+import json
+import math
+import os
+import pickle
+import sys
+import time
+import uuid
+import zlib
+from collections.abc import Callable
+from pathlib import Path
+
+import numpy as np
+import sentencepiece as spm
+
+import mlx.core as mx
+import mlx.nn as nn
+import mlx.optimizers as optim
+from mlx.utils import tree_flatten, tree_unflatten
+
+# ==============================================================================
+# SHARD FORMAT + COMPUTE DTYPE
+# ==============================================================================
+
+COMPUTE_DTYPE = mx.bfloat16
+
+# ==============================================================================
+# HYPERPARAMETERS
+# ==============================================================================
+# Default Simple Baseline run:
+# - 9 transformer blocks at width 512
+# - 8 attention heads with 4 KV heads (GQA) and 2x MLP expansion
+# - vocab size 1024, sequence length 1024, tied embeddings
+# - 524,288 train tokens per step for 20,000 iterations with a ~10 minute cap
+class Hyperparameters:
+    # Data / tokenizer.
+    data_path: str = os.environ.get("DATA_PATH", "./data/datasets/fineweb10B_sp1024")
+    tokenizer_path: str = os.environ.get("TOKENIZER_PATH", "./data/tokenizers/fineweb_1024_bpe.model")
+    run_id: str = os.environ.get("RUN_ID", str(uuid.uuid4()))
+    seed: int = int(os.environ.get("SEED", 1337))
+
+    # Training loop. These defaults now mirror train_gpt.py on a single process.
+    iterations: int = int(os.environ.get("ITERATIONS", 20_000))
+    val_loss_every: int = int(os.environ.get("VAL_LOSS_EVERY", 0))
+    # Validation always uses the full fineweb_val split.
+    val_batch_size: int = int(os.environ.get("VAL_BATCH_SIZE", 524_288))
+    train_log_every: int = int(os.environ.get("TRAIN_LOG_EVERY", 200))
+    train_batch_tokens: int = int(os.environ.get("TRAIN_BATCH_TOKENS", 524_288))
+    grad_accum_steps: int = int(os.environ.get("GRAD_ACCUM_STEPS", 8))
+    train_seq_len: int = int(os.environ.get("TRAIN_SEQ_LEN", os.environ.get("TRAIN_MAX_SEQ_LEN", 1024)))
+    # Chunk each logical MLX microbatch into smaller sub-batches to reduce peak
+    # memory pressure without changing the effective optimizer batch.
+    mlx_max_microbatch_tokens: int = int(os.environ.get("MLX_MAX_MICROBATCH_TOKENS", 8_192))
+    warmup_steps: int = int(os.environ.get("WARMUP_STEPS", 20))
+    warmdown_iters: int = int(os.environ.get("WARMDOWN_ITERS", 1200))
+    max_wallclock_seconds: float = float(os.environ.get("MAX_WALLCLOCK_SECONDS", 600.0))
+
+    # Model (defaults match the current baseline setup).
+    vocab_size: int = int(os.environ.get("VOCAB_SIZE", 1024))
+    num_layers: int = int(os.environ.get("NUM_LAYERS", 9))
+    model_dim: int = int(os.environ.get("MODEL_DIM", 512))
+    num_heads: int = int(os.environ.get("NUM_HEADS", 8))
+    num_kv_heads: int = int(os.environ.get("NUM_KV_HEADS", 4))
+    mlp_mult: int = int(os.environ.get("MLP_MULT", 2))
+    tie_embeddings: bool = bool(int(os.environ.get("TIE_EMBEDDINGS", "1")))
+    tied_embed_init_std: float = float(os.environ.get("TIED_EMBED_INIT_STD", 0.005))
+    logit_chunk_tokens: int = int(os.environ.get("LOGIT_CHUNK_TOKENS", 0))
+    logit_softcap: float = float(os.environ.get("LOGIT_SOFTCAP", 30.0))
+    rope_base: float = float(os.environ.get("ROPE_BASE", 10000.0))
+    qk_gain_init: float = float(os.environ.get("QK_GAIN_INIT", 1.5))
+
+    # Optimizer. We keep the same per-group defaults as train_gpt.py.
+    beta1: float = float(os.environ.get("BETA1", 0.9))
+    beta2: float = float(os.environ.get("BETA2", 0.95))
+    adam_eps: float = float(os.environ.get("ADAM_EPS", 1e-8))
+    tied_embed_lr: float = float(os.environ.get("TIED_EMBED_LR", 0.05))
+    matrix_lr: float = float(os.environ.get("MATRIX_LR", 0.04))
+    scalar_lr: float = float(os.environ.get("SCALAR_LR", 0.04))
+    muon_momentum: float = float(os.environ.get("MUON_MOMENTUM", 0.95))
+    muon_backend_steps: int = int(os.environ.get("MUON_BACKEND_STEPS", 5))
+    muon_momentum_warmup_start: float = float(os.environ.get("MUON_MOMENTUM_WARMUP_START", 0.85))
+    muon_momentum_warmup_steps: int = int(os.environ.get("MUON_MOMENTUM_WARMUP_STEPS", 500))
+    grad_clip_norm: float = float(os.environ.get("GRAD_CLIP_NORM", 0.0))
+
+    out_dir: str = os.environ.get("OUT_DIR", "logs")
+
+    @property
+    def train_files(self) -> str:
+        return f"{self.data_path}/fineweb_train_*.bin"
+
+    @property
+    def val_files(self) -> str:
+        return f"{self.data_path}/fineweb_val_*.bin"
+
+    @property
+    def microbatch_tokens(self) -> int:
+        return self.train_batch_tokens // self.grad_accum_steps
+
+    def lr_mul(self, step: int, elapsed_ms: float) -> float:
+        if self.warmdown_iters <= 0:
+            return 1.0
+        if self.max_wallclock_seconds <= 0:
+            warmdown_start = max(self.iterations - self.warmdown_iters, 0)
+            return max((self.iterations - step) / max(self.warmdown_iters, 1), 0.0) if warmdown_start <= step < self.iterations else 1.0
+        step_ms = elapsed_ms / max(step, 1)
+        warmdown_ms = self.warmdown_iters * step_ms
+        remaining_ms = max(1000.0 * self.max_wallclock_seconds - elapsed_ms, 0.0)
+        return remaining_ms / max(warmdown_ms, 1e-9) if remaining_ms <= warmdown_ms else 1.0
+
+
+CONTROL_TENSOR_NAME_PATTERNS = tuple(
+    pattern
+    for pattern in os.environ.get(
+        "CONTROL_TENSOR_NAME_PATTERNS",
+        "attn_scale,attn_scales,mlp_scale,mlp_scales,resid_mix,resid_mixes,q_gain,skip_weight,skip_weights",
+    ).split(",")
+    if pattern
+)
+INT8_KEEP_FLOAT_FP32_NAME_PATTERNS = tuple(
+    pattern
+    for pattern in os.environ.get(
+        "INT8_KEEP_FLOAT_FP32_NAME_PATTERNS",
+        ",".join(CONTROL_TENSOR_NAME_PATTERNS),
+    ).split(",")
+    if pattern
+)
+
+
+def token_chunks(total_tokens: int, seq_len: int, max_chunk_tokens: int) -> list[int]:
+    usable_total = (total_tokens // seq_len) * seq_len
+    if usable_total <= 0:
+        raise ValueError(f"token budget too small for seq_len={seq_len}")
+    usable_chunk = max((max_chunk_tokens // seq_len) * seq_len, seq_len)
+    chunks: list[int] = []
+    remaining = usable_total
+    while remaining > 0:
+        chunk = min(remaining, usable_chunk)
+        chunks.append(chunk)
+        remaining -= chunk
+    return chunks
+
+
+def accumulate_flat_grads(
+    accum: dict[str, mx.array] | None,
+    grads_tree: dict,
+    scale: float,
+) -> dict[str, mx.array]:
+    flat = dict(tree_flatten(grads_tree))
+    if accum is None:
+        return {k: g * scale for k, g in flat.items()}
+    for k, g in flat.items():
+        accum[k] = accum[k] + g * scale
+    return accum
+
+
+# ==============================================================================
+# MATH HELPERS
+# ==============================================================================
+
+def rms_norm(x: mx.array, eps: float = 1e-6) -> mx.array:
+    return (x * mx.rsqrt(mx.mean(x * x, axis=-1, keepdims=True) + eps)).astype(x.dtype)
+
+
+def zeropower_newtonschulz5(g: mx.array, steps: int, eps: float = 1e-7) -> mx.array:
+    # Orthogonalize a 2D update matrix with a fast Newton-Schulz iteration.
+    # Muon uses this to normalize matrix-shaped gradients before applying them.
+    # Background on Muon: https://kellerjordan.github.io/posts/muon/
+    a, b, c = 3.4445, -4.7750, 2.0315
+    x = g.astype(mx.float32)
+    x = x / (mx.sqrt(mx.sum(x * x)) + eps)
+    transposed = x.shape[0] > x.shape[1]
+    if transposed:
+        x = x.T
+    for _ in range(steps):
+        a_mat = x @ x.T
+        b_mat = b * a_mat + c * (a_mat @ a_mat)
+        x = a * x + b_mat @ x
+    if transposed:
+        x = x.T
+    return x.astype(g.dtype)
+
+
+def load_data_shard(path: Path) -> np.ndarray:
+    header_bytes = 256 * np.dtype("<i4").itemsize
+    token_bytes = np.dtype("<u2").itemsize
+    header = np.fromfile(path, dtype="<i4", count=256)
+    if header.size != 256 or int(header[0]) != 20240520 or int(header[1]) != 1:
+        raise ValueError(f"Unexpected shard header for {path}")
+    num_tokens = int(header[2])
+    if path.stat().st_size != header_bytes + num_tokens * token_bytes:
+        raise ValueError(f"Shard size mismatch for {path}")
+    tokens = np.fromfile(path, dtype="<u2", count=num_tokens, offset=header_bytes)
+    if tokens.size != num_tokens:
+        raise ValueError(f"Short read for {path}")
+    return tokens.astype(np.int32, copy=False)
+
+
+# ==============================================================================
+# TOKEN STREAMING / BATCHING
+# ==============================================================================
+
+
+class TokenStream:
+    def __init__(
+        self,
+        pattern: str,
+        log_fn: Callable[[str], None] | None = None,
+        dataset_name: str = "",
+    ):
+        self.files = [Path(p) for p in sorted(glob.glob(pattern))]
+        if not self.files:
+            raise FileNotFoundError(f"No files found for pattern: {pattern}")
+        self.epoch = 1
+        self.file_idx = 0
+        self.log_fn = log_fn
+        self.dataset_name = dataset_name
+        self.tokens = load_data_shard(self.files[0])
+        self.pos = 0
+
+    def next_file(self) -> None:
+        self.file_idx = (self.file_idx + 1) % len(self.files)
+        if self.file_idx == 0:
+            self.epoch += 1
+            if self.log_fn is not None:
+                self.log_fn(
+                    f"WARNING: starting epoch:{self.epoch} "
+                    f"dataset:{self.dataset_name} train_shards:{len(self.files)}"
+                )
+        self.tokens = load_data_shard(self.files[self.file_idx])
+        self.pos = 0
+
+    def take(self, n: int) -> np.ndarray:
+        chunks: list[np.ndarray] = []
+        left = n
+        while left > 0:
+            if self.pos >= self.tokens.size:
+                self.next_file()
+            k = min(left, int(self.tokens.size - self.pos))
+            chunks.append(self.tokens[self.pos : self.pos + k])
+            self.pos += k
+            left -= k
+        return chunks[0] if len(chunks) == 1 else np.concatenate(chunks, axis=0)
+
+
+class TokenLoader:
+    def __init__(
+        self,
+        pattern: str,
+        log_fn: Callable[[str], None] | None = None,
+        dataset_name: str = "",
+    ):
+        self.stream = TokenStream(pattern, log_fn=log_fn, dataset_name=dataset_name)
+
+    def next_batch(self, batch_tokens: int, seq_len: int) -> tuple[mx.array, mx.array]:
+        usable = (batch_tokens // seq_len) * seq_len
+        if usable <= 0:
+            raise ValueError(f"token budget too small for seq_len={seq_len}")
+        chunk = self.stream.take(usable + 1)
+        x = chunk[:-1].reshape(-1, seq_len)
+        y = chunk[1:].reshape(-1, seq_len)
+        return mx.array(x, dtype=mx.int32), mx.array(y, dtype=mx.int32)
+
+
+# ==============================================================================
+# MODEL BLOCKS
+# ==============================================================================
+
+class CastedLinear(nn.Module):
+    def __init__(self, in_dim: int, out_dim: int):
+        super().__init__()
+        self.weight = nn.Linear(in_dim, out_dim, bias=False).weight.astype(mx.float32)
+
+    def __call__(self, x: mx.array) -> mx.array:
+        return x @ self.weight.astype(x.dtype).T
+
+
+class RMSNormNoWeight(nn.Module):
+    # MLX module wrapper around the functional RMSNorm helper so it composes nicely in blocks.
+    def __call__(self, x: mx.array) -> mx.array:
+        return rms_norm(x)
+
+
+class CausalSelfAttention(nn.Module):
+    # - separate q/k/v projections
+    # - RMSNorm on q and k before attention
+    # - RoPE on q and k
+    # - causal masked SDPA
+    def __init__(
+        self,
+        dim: int,
+        num_heads: int,
+        num_kv_heads: int,
+        rope_base: float,
+        qk_gain_init: float,
+    ):
+        super().__init__()
+        if dim % num_heads != 0:
+            raise ValueError("model_dim must be divisible by num_heads")
+        if num_heads % num_kv_heads != 0:
+            raise ValueError("num_heads must be divisible by num_kv_heads")
+        self.num_heads = num_heads
+        self.num_kv_heads = num_kv_heads
+        self.head_dim = dim // num_heads
+        if self.head_dim % 2 != 0:
+            raise ValueError("head_dim must be even for RoPE")
+        kv_dim = self.num_kv_heads * self.head_dim
+        self.c_q = CastedLinear(dim, dim)
+        self.c_k = CastedLinear(dim, kv_dim)
+        self.c_v = CastedLinear(dim, kv_dim)
+        self.proj = CastedLinear(dim, dim)
+        self.q_gain = mx.ones((num_heads,), dtype=mx.float32) * qk_gain_init
+        self.rope = nn.RoPE(self.head_dim, traditional=False, base=rope_base)
+        self.scale = self.head_dim ** -0.5
+
+    def __call__(self, x: mx.array) -> mx.array:
+        bsz, seqlen, dim = x.shape
+        q = self.c_q(x).reshape(bsz, seqlen, self.num_heads, self.head_dim).transpose(0, 2, 1, 3)
+        k = self.c_k(x).reshape(bsz, seqlen, self.num_kv_heads, self.head_dim).transpose(0, 2, 1, 3)
+        v = self.c_v(x).reshape(bsz, seqlen, self.num_kv_heads, self.head_dim).transpose(0, 2, 1, 3)
+
+        q = self.rope(rms_norm(q).astype(COMPUTE_DTYPE))
+        k = self.rope(rms_norm(k).astype(COMPUTE_DTYPE))
+        q = q * self.q_gain.astype(q.dtype)[None, :, None, None]
+        y = mx.fast.scaled_dot_product_attention(q, k, v, scale=self.scale, mask="causal")
+        y = y.transpose(0, 2, 1, 3).reshape(bsz, seqlen, dim)
+        return self.proj(y)
+
+
+class MLP(nn.Module):
+    # Baseline MLP uses relu^2 instead of GELU/SiLU. It is cheap and works well in this setup.
+    def __init__(self, dim: int, mlp_mult: int):
+        super().__init__()
+        hidden = dim * mlp_mult
+        self.fc = CastedLinear(dim, hidden)
+        self.proj = CastedLinear(hidden, dim)
+
+    def __call__(self, x: mx.array) -> mx.array:
+        x = nn.relu(self.fc(x))
+        return self.proj(x * x)
+
+
+class Block(nn.Module):
+    def __init__(
+        self,
+        dim: int,
+        num_heads: int,
+        num_kv_heads: int,
+        mlp_mult: int,
+        rope_base: float,
+        qk_gain_init: float,
+    ):
+        super().__init__()
+        self.attn_norm = RMSNormNoWeight()
+        self.mlp_norm = RMSNormNoWeight()
+        self.attn = CausalSelfAttention(dim, num_heads, num_kv_heads, rope_base, qk_gain_init)
+        self.mlp = MLP(dim, mlp_mult)
+        self.attn_scale = mx.ones((dim,), dtype=mx.float32)
+        self.mlp_scale = mx.ones((dim,), dtype=mx.float32)
+        self.resid_mix = mx.array(np.stack((np.ones((dim,), dtype=np.float32), np.zeros((dim,), dtype=np.float32))))
+
+    def __call__(self, x: mx.array, x0: mx.array) -> mx.array:
+        mix = self.resid_mix.astype(x.dtype)
+        x = mix[0][None, None, :] * x + mix[1][None, None, :] * x0
+        attn_out = self.attn(self.attn_norm(x))
+        x = x + self.attn_scale.astype(x.dtype)[None, None, :] * attn_out
+        x = x + self.mlp_scale.astype(x.dtype)[None, None, :] * self.mlp(self.mlp_norm(x))
+        return x
+
+
+class GPT(nn.Module):
+    # - token embedding + RMSNorm
+    # - encoder half accumulates skip tensors
+    # - decoder half consumes reversed skips with learned skip_weights
+    # - tied embeddings for the LM head (the baseline default setup)
+    def __init__(self, vocab_size: int, num_layers: int, dim: int, num_heads: int, num_kv_heads: int, mlp_mult: int,
+                 logit_chunk_tokens: int, logit_softcap: float, rope_base: float, tied_embed_init_std: float,
+                 qk_gain_init: float):
+        super().__init__()
+        if logit_softcap <= 0.0:
+            raise ValueError(f"logit_softcap must be positive, got {logit_softcap}")
+        self.logit_chunk_tokens = logit_chunk_tokens
+        self.logit_softcap = logit_softcap
+
+        self.tok_emb = nn.Embedding(vocab_size, dim)
+        self.num_encoder_layers = num_layers // 2
+        self.num_decoder_layers = num_layers - self.num_encoder_layers
+        self.num_skip_weights = min(self.num_encoder_layers, self.num_decoder_layers)
+        self.skip_weights = mx.ones((self.num_skip_weights, dim), dtype=mx.float32)
+        self.blocks = [
+            Block(dim, num_heads, num_kv_heads, mlp_mult, rope_base, qk_gain_init)
+            for i in range(num_layers)
+        ]
+        self.final_norm = RMSNormNoWeight()
+
+        for b in self.blocks:
+            b.attn.proj.weight = mx.zeros_like(b.attn.proj.weight)
+            b.mlp.proj.weight = mx.zeros_like(b.mlp.proj.weight)
+        self.tok_emb.weight = (
+            mx.random.normal(self.tok_emb.weight.shape, dtype=mx.float32) * tied_embed_init_std
+        ).astype(COMPUTE_DTYPE)
+
+    def softcap(self, logits: mx.array) -> mx.array:
+        c = self.logit_softcap
+        return c * mx.tanh(logits / c)
+
+    def __call__(self, input_ids: mx.array) -> mx.array:
+        x = rms_norm(self.tok_emb(input_ids).astype(COMPUTE_DTYPE))
+        x0 = x
+        skips: list[mx.array] = []
+
+        for i in range(self.num_encoder_layers):
+            x = self.blocks[i](x, x0)
+            skips.append(x)
+        for i in range(self.num_decoder_layers):
+            # Odd layer counts have one more decoder block than encoder block. The baseline only
+            # applies a skip connection when one exists, then runs the remaining decoder block(s)
+            # without an added skip.
+            if skips:
+                x = x + self.skip_weights[i].astype(x.dtype)[None, None, :] * skips.pop()
+            x = self.blocks[self.num_encoder_layers + i](x, x0)
+        return self.final_norm(x)
+
+    def loss(self, input_ids: mx.array, target_ids: mx.array) -> mx.array:
+        # Cross-entropy over flattened tokens. We keep optional logit chunking because it is a useful
+        # memory knob on Macs, but the common path is chunk_tokens=0 (single matmul + CE).
+        x = self(input_ids).reshape(-1, self.tok_emb.weight.shape[1])
+        y = target_ids.reshape(-1)
+        if self.logit_chunk_tokens <= 0 or x.shape[0] <= self.logit_chunk_tokens:
+            logits_proj = x @ self.tok_emb.weight.astype(x.dtype).T
+            logits = self.softcap(logits_proj)
+            return nn.losses.cross_entropy(logits.astype(mx.float32), y, reduction="mean")
+
+        loss_sum = mx.array(0.0, dtype=mx.float32)
+        n = int(x.shape[0])
+        for s in range(0, n, self.logit_chunk_tokens):
+            e = min(s + self.logit_chunk_tokens, n)
+            logits_proj = x[s:e] @ self.tok_emb.weight.astype(x.dtype).T
+            logits = self.softcap(logits_proj)
+            loss_sum = loss_sum + nn.losses.cross_entropy(logits.astype(mx.float32), y[s:e], reduction="sum")
+        return loss_sum / float(n)
+
+# ==============================================================================
+# OPTIMIZERS (MUON + ADAM SPLIT)
+# ==============================================================================
+class Muon:
+    # Muon applies SGD-momentum to matrix gradients, then orthogonalizes the result before the
+    # parameter update.
+    def __init__(self, keys: list[str], params: dict[str, mx.array], args: Hyperparameters):
+        self.keys = keys
+        self.args = args
+        self.buffers = {k: mx.zeros_like(params[k]) for k in keys}
+
+    def step(self, params: dict[str, mx.array], grads: dict[str, mx.array], step: int, lr_mul: float) -> dict[str, mx.array]:
+        if self.args.muon_momentum_warmup_steps:
+            t = min(step / self.args.muon_momentum_warmup_steps, 1.0)
+            momentum = (1.0 - t) * self.args.muon_momentum_warmup_start + t * self.args.muon_momentum
+        else:
+            momentum = self.args.muon_momentum
+        lr = self.args.matrix_lr * lr_mul
+        out: dict[str, mx.array] = {}
+        for k in self.keys:
+            p = params[k]
+            g = grads[k]
+            buf = momentum * self.buffers[k] + g
+            self.buffers[k] = buf
+            g_eff = g + momentum * buf
+            g_ortho = zeropower_newtonschulz5(g_eff, self.args.muon_backend_steps)
+            scale = math.sqrt(max(1.0, float(p.shape[0]) / float(p.shape[1])))
+            out[k] = p - lr * (g_ortho * scale).astype(p.dtype)
+        return out
+
+
+class SplitOptimizers:
+    # - embeddings: Adam with the tied-embedding LR
+    # - block matrices (2D): Muon
+    # - block scalars + skip weights: Adam
+    # This preserves the high-level optimization behavior even though MLX internals differ.
+    def __init__(self, model: GPT, args: Hyperparameters):
+        self.args = args
+        params = dict(tree_flatten(model.parameters()))
+        self.embed_key = "tok_emb.weight"
+        self.matrix_keys = [
+            k
+            for k, p in params.items()
+            if k.startswith("blocks.") and p.ndim == 2 and not any(pattern in k for pattern in CONTROL_TENSOR_NAME_PATTERNS)
+        ]
+        self.scalar_keys = [
+            k
+            for k, p in params.items()
+            if k == "skip_weights" or (k.startswith("blocks.") and (p.ndim < 2 or any(pattern in k for pattern in CONTROL_TENSOR_NAME_PATTERNS)))
+        ]
+
+        self.muon = Muon(self.matrix_keys, params, args)
+        self.adam_embed = optim.Adam(
+            learning_rate=args.tied_embed_lr,
+            betas=[args.beta1, args.beta2],
+            eps=args.adam_eps,
+            bias_correction=True,
+        )
+        self.adam_scalar = optim.Adam(
+            learning_rate=args.scalar_lr,
+            betas=[args.beta1, args.beta2],
+            eps=args.adam_eps,
+            bias_correction=True,
+        )
+
+    def step(self, model: GPT, grads_tree: dict, step: int, lr_mul: float) -> None:
+        params = dict(tree_flatten(model.parameters()))
+        grads = dict(tree_flatten(grads_tree))
+        updated = dict(params)
+
+        updated.update(self.muon.step(params, grads, step=step, lr_mul=lr_mul))
+
+        self.adam_embed.learning_rate = self.args.tied_embed_lr * lr_mul
+        updated.update(
+            self.adam_embed.apply_gradients(
+                {self.embed_key: grads[self.embed_key]},
+                {self.embed_key: params[self.embed_key]},
+            )
+        )
+
+        self.adam_scalar.learning_rate = self.args.scalar_lr * lr_mul
+        scalar_grads = {k: grads[k] for k in self.scalar_keys}
+        scalar_params = {k: params[k] for k in self.scalar_keys}
+        updated.update(self.adam_scalar.apply_gradients(scalar_grads, scalar_params))
+
+        model.update(tree_unflatten(list(updated.items())))
+
+# ==============================================================================
+# QUANTIZATION (INT8 + ZLIB)
+# ==============================================================================
+# - per-row int8 for 2D float tensors
+# - per-tensor int8 for other float tensors
+# - fp16 passthrough for small float tensors
+# - exact passthrough for non-floats
+
+MX_DTYPE_FROM_NAME = {
+    "float32": mx.float32,
+    "float16": mx.float16,
+    "bfloat16": mx.bfloat16,
+}
+
+INT8_KEEP_FLOAT_MAX_NUMEL = 65_536
+INT8_KEEP_FLOAT_STORE_DTYPE = np.float16
+INT8_PER_ROW_SCALE_DTYPE = np.float16
+INT8_CLIP_PERCENTILE = 99.99984
+INT8_CLIP_Q = INT8_CLIP_PERCENTILE / 100.0
+
+
+def _np_float32(arr: mx.array) -> np.ndarray:
+    return np.array(arr.astype(mx.float32), dtype=np.float32, copy=False)
+
+
+def keep_float_array(name: str, arr: mx.array, passthrough_orig_dtypes: dict[str, str]) -> np.ndarray:
+    if any(pattern in name for pattern in INT8_KEEP_FLOAT_FP32_NAME_PATTERNS):
+        return np.ascontiguousarray(_np_float32(arr))
+    if arr.dtype in {mx.float32, mx.bfloat16}:
+        passthrough_orig_dtypes[name] = str(arr.dtype).split(".")[-1]
+        return np.ascontiguousarray(np.array(arr.astype(mx.float16), dtype=INT8_KEEP_FLOAT_STORE_DTYPE, copy=False))
+    return np.ascontiguousarray(np.array(arr, copy=True))
+
+
+def quantize_float_array(arr: mx.array) -> tuple[np.ndarray, np.ndarray]:
+    f32 = _np_float32(arr)
+    if f32.ndim == 2:
+        # Matrices get one scale per row, which usually tracks output-channel
+        # ranges much better than a single tensor-wide scale.
+        clip_abs = np.quantile(np.abs(f32), INT8_CLIP_Q, axis=1) if f32.size else np.empty((f32.shape[0],), dtype=np.float32)
+        clipped = np.clip(f32, -clip_abs[:, None], clip_abs[:, None])
+        scale = np.maximum(clip_abs / 127.0, 1.0 / 127.0).astype(np.float32, copy=False)
+        q = np.clip(np.round(clipped / scale[:, None]), -127, 127).astype(np.int8, copy=False)
+        return np.ascontiguousarray(q), np.ascontiguousarray(scale.astype(INT8_PER_ROW_SCALE_DTYPE, copy=False))
+
+    # Vectors / scalars use a simpler per-tensor scale.
+    clip_abs = float(np.quantile(np.abs(f32).reshape(-1), INT8_CLIP_Q)) if f32.size else 0.0
+    scale = np.array(clip_abs / 127.0 if clip_abs > 0.0 else 1.0, dtype=np.float32)
+    q = np.clip(np.round(np.clip(f32, -clip_abs, clip_abs) / scale), -127, 127).astype(np.int8, copy=False)
+    return np.ascontiguousarray(q), scale
+
+
+def quantize_state_dict_int8(flat_state: dict[str, mx.array]) -> tuple[dict[str, object], dict[str, int]]:
+    quantized: dict[str, np.ndarray] = {}
+    scales: dict[str, np.ndarray] = {}
+    dtypes: dict[str, str] = {}
+    passthrough: dict[str, np.ndarray] = {}
+    passthrough_orig_dtypes: dict[str, str] = {}
+    qmeta: dict[str, dict[str, object]] = {}
+    stats = dict.fromkeys(
+        ("param_count", "num_tensors", "num_float_tensors", "num_nonfloat_tensors", "baseline_tensor_bytes", "int8_payload_bytes"),
+        0,
+    )
+    for name, arr in flat_state.items():
+        stats["param_count"] += int(arr.size)
+        stats["num_tensors"] += 1
+        stats["baseline_tensor_bytes"] += int(arr.nbytes)
+        if not mx.issubdtype(arr.dtype, mx.floating):
+            stats["num_nonfloat_tensors"] += 1
+            passthrough[name] = np.ascontiguousarray(np.array(arr))
+            stats["int8_payload_bytes"] += int(passthrough[name].nbytes)
+            continue
+
+        # Small float tensors are cheap enough to keep directly. We still downcast
+        # fp32/bf16 passthrough tensors to fp16 so metadata does not dominate size.
+        if int(arr.size) <= INT8_KEEP_FLOAT_MAX_NUMEL:
+            kept = keep_float_array(name, arr, passthrough_orig_dtypes)
+            passthrough[name] = kept
+            stats["int8_payload_bytes"] += int(kept.nbytes)
+            continue
+
+        stats["num_float_tensors"] += 1
+        q, s = quantize_float_array(arr)
+        if s.ndim > 0:
+            qmeta[name] = {"scheme": "per_row", "axis": 0}
+        quantized[name] = q
+        scales[name] = s
+        dtypes[name] = str(arr.dtype).split(".")[-1]
+        stats["int8_payload_bytes"] += int(q.nbytes + s.nbytes)
+    obj: dict[str, object] = {
+        "__quant_format__": "int8_clean_per_row_v1",
+        "quantized": quantized,
+        "scales": scales,
+        "dtypes": dtypes,
+        "passthrough": passthrough,
+    }
+    if qmeta:
+        obj["qmeta"] = qmeta
+    if passthrough_orig_dtypes:
+        obj["passthrough_orig_dtypes"] = passthrough_orig_dtypes
+    return obj, stats
+
+
+def dequantize_state_dict_int8(quant_obj: dict[str, object]) -> dict[str, mx.array]:
+    out: dict[str, mx.array] = {}
+    qmeta = quant_obj.get("qmeta", {})
+    passthrough_orig_dtypes = quant_obj.get("passthrough_orig_dtypes", {})
+    for name, q in quant_obj["quantized"].items():
+        q_np = np.asarray(q, dtype=np.int8)
+        dtype_name = quant_obj["dtypes"][name]
+        scale = np.asarray(quant_obj["scales"][name], dtype=np.float32)
+        if qmeta.get(name, {}).get("scheme") == "per_row" or scale.ndim > 0:
+            # Broadcast the saved row scale back across trailing dimensions.
+            out_arr = q_np.astype(np.float32) * scale.reshape((q_np.shape[0],) + (1,) * (q_np.ndim - 1))
+        else:
+            out_arr = q_np.astype(np.float32) * float(scale)
+        out[name] = mx.array(out_arr, dtype=MX_DTYPE_FROM_NAME[dtype_name])
+    for name, arr in quant_obj["passthrough"].items():
+        # Restore small tensors, undoing the temporary fp16 storage cast if needed.
+        out_arr = np.array(arr, copy=True)
+        orig_dtype = passthrough_orig_dtypes.get(name)
+        if isinstance(orig_dtype, str):
+            out[name] = mx.array(out_arr, dtype=MX_DTYPE_FROM_NAME[orig_dtype])
+        else:
+            out[name] = mx.array(out_arr)
+    return out
+
+
+def build_sentencepiece_luts(
+    sp: spm.SentencePieceProcessor, vocab_size: int
+) -> tuple[np.ndarray, np.ndarray, np.ndarray]:
+    sp_vocab_size = int(sp.vocab_size())
+    table_size = max(sp_vocab_size, vocab_size)
+    base_bytes_lut = np.zeros((table_size,), dtype=np.int16)
+    has_leading_space_lut = np.zeros((table_size,), dtype=np.bool_)
+    is_boundary_token_lut = np.ones((table_size,), dtype=np.bool_)
+    for token_id in range(sp_vocab_size):
+        if sp.is_control(token_id) or sp.is_unknown(token_id) or sp.is_unused(token_id):
+            continue
+        is_boundary_token_lut[token_id] = False
+        if sp.is_byte(token_id):
+            base_bytes_lut[token_id] = 1
+            continue
+        piece = sp.id_to_piece(token_id)
+        if piece.startswith("▁"):
+            has_leading_space_lut[token_id] = True
+            piece = piece[1:]
+        base_bytes_lut[token_id] = len(piece.encode("utf-8"))
+    return base_bytes_lut, has_leading_space_lut, is_boundary_token_lut
+
+
+def validate_dataset_tokenizer_pair(data_path: str, tokenizer_path: str) -> tuple[str, int, int | None]:
+    # The shard directory and tokenizer are coupled: val_bpb is only meaningful if we
+    # decode bytes with the exact tokenizer that produced the shards. The manifest
+    # lets the training script fail fast on accidental dataset/tokenizer mismatches.
+    dataset_dir = Path(data_path).resolve()
+    actual_train_files = len(list(dataset_dir.glob("fineweb_train_*.bin")))
+    if len(dataset_dir.parents) < 2:
+        return dataset_dir.name, actual_train_files, None
+    manifest_path = dataset_dir.parents[1] / "manifest.json"
+    if not manifest_path.is_file():
+        return dataset_dir.name, actual_train_files, None
+
+    manifest = json.loads(manifest_path.read_text(encoding="utf-8"))
+    dataset_entry = next((x for x in manifest.get("datasets", []) if x.get("name") == dataset_dir.name), None)
+    if dataset_entry is None:
+        return dataset_dir.name, actual_train_files, None
+
+    tokenizer_name = dataset_entry.get("tokenizer_name")
+    tokenizer_entry = (
+        next((x for x in manifest.get("tokenizers", []) if x.get("name") == tokenizer_name), None)
+        if tokenizer_name
+        else None
+    )
+    expected_name = Path((tokenizer_entry or {}).get("model_path") or (tokenizer_entry or {}).get("path") or "").name
+    if expected_name and Path(tokenizer_path).name != expected_name:
+        raise ValueError(f"{dataset_dir.name} expects tokenizer {expected_name}, got {Path(tokenizer_path).name}")
+    expected_train_files = (dataset_entry.get("stats") or {}).get("files_train")
+    if expected_train_files is not None:
+        expected_train_files = int(expected_train_files)
+        if actual_train_files > expected_train_files:
+            raise ValueError(
+                f"{dataset_dir.name} has more train shards than expected: found {actual_train_files}, "
+                f"manifest says {expected_train_files}"
+            )
+    return dataset_dir.name, actual_train_files, expected_train_files
+
+
+def load_validation_tokens(pattern: str, seq_len: int) -> np.ndarray:
+    files = [Path(p) for p in sorted(glob.glob(pattern))]
+    if not files:
+        raise FileNotFoundError(f"No files found for pattern: {pattern}")
+    # The export pipeline writes the fixed first-50k-doc validation set to fineweb_val_*.
+    tokens = np.ascontiguousarray(np.concatenate([load_data_shard(file) for file in files], axis=0))
+    usable = ((tokens.size - 1) // seq_len) * seq_len
+    if usable <= 0:
+        raise ValueError(f"Validation split is too short for TRAIN_SEQ_LEN={seq_len}")
+    return tokens[: usable + 1]
+
+
+def loss_and_grad_chunked(
+    args: Hyperparameters,
+    train_loader: TokenLoader,
+    compiled_loss_and_grad,
+) -> tuple[mx.array, dict]:
+    chunk_sizes = token_chunks(args.microbatch_tokens, args.train_seq_len, args.mlx_max_microbatch_tokens)
+    total_tokens = float(sum(chunk_sizes))
+    loss_value = mx.array(0.0, dtype=mx.float32)
+    grad_accum: dict[str, mx.array] | None = None
+    for chunk_tokens in chunk_sizes:
+        x, y = train_loader.next_batch(chunk_tokens, args.train_seq_len)
+        loss, grads = compiled_loss_and_grad(x, y)
+        scale = float(y.size) / total_tokens
+        loss_value = loss_value + loss.astype(mx.float32) * scale
+        grad_accum = accumulate_flat_grads(grad_accum, grads, scale)
+    return loss_value, tree_unflatten(list(grad_accum.items()))
+
+
+def eval_val(
+    args: Hyperparameters,
+    compiled_loss,
+    val_tokens: np.ndarray,
+    base_bytes_lut: np.ndarray,
+    has_leading_space_lut: np.ndarray,
+    is_boundary_token_lut: np.ndarray,
+) -> tuple[float, float]:
+    # Validation computes two metrics:
+    # - val_loss: token cross-entropy (natural log)
+    # - val_bpb: tokenizer-agnostic compression metric used by the challenge
+    val_batch_tokens = args.val_batch_size // args.grad_accum_steps
+    if val_batch_tokens < args.train_seq_len:
+        raise ValueError(
+            "VAL_BATCH_SIZE must provide at least one sequence; "
+            f"got VAL_BATCH_SIZE={args.val_batch_size}, GRAD_ACCUM_STEPS={args.grad_accum_steps}, "
+            f"TRAIN_SEQ_LEN={args.train_seq_len}"
+        )
+    val_batch_seqs = val_batch_tokens // args.train_seq_len
+    total_seqs = (val_tokens.size - 1) // args.train_seq_len
+    total_loss = mx.array(0.0, dtype=mx.float32)
+    total_tokens = 0.0
+    total_bytes = 0.0
+    for batch_seq_start in range(0, total_seqs, val_batch_seqs):
+        batch_seq_end = min(batch_seq_start + val_batch_seqs, total_seqs)
+        raw_start = batch_seq_start * args.train_seq_len
+        raw_end = batch_seq_end * args.train_seq_len + 1
+        chunk = val_tokens[raw_start:raw_end]
+        x_np = chunk[:-1].reshape(-1, args.train_seq_len)
+        y_np = chunk[1:].reshape(-1, args.train_seq_len)
+        x = mx.array(x_np, dtype=mx.int32)
+        y = mx.array(y_np, dtype=mx.int32)
+        chunk_token_count = float(y.size)
+        total_loss = total_loss + compiled_loss(x, y).astype(mx.float32) * chunk_token_count
+        prev_ids = x_np.reshape(-1)
+        tgt_ids = y_np.reshape(-1)
+        bytes_np = base_bytes_lut[tgt_ids].astype(np.int16, copy=True)
+        bytes_np += (
+            has_leading_space_lut[tgt_ids] & ~is_boundary_token_lut[prev_ids]
+        ).astype(np.int16, copy=False)
+        total_tokens += chunk_token_count
+        total_bytes += float(bytes_np.astype(np.float64).sum())
+    total_loss = total_loss / total_tokens
+    mx.eval(total_loss)
+    val_loss = float(total_loss.item())
+    bits_per_token = val_loss / math.log(2.0)
+    val_bpb = bits_per_token * (total_tokens / total_bytes)
+    return val_loss, val_bpb
+
+# -----------------------------
+# TRAINING
+# -----------------------------
+
+def clip_grad_tree(grads_tree: dict, max_norm: float) -> dict:
+    if max_norm <= 0:
+        return grads_tree
+    flat = dict(tree_flatten(grads_tree))
+    total_sq = 0.0
+    for grad in flat.values():
+        total_sq += float(np.sum(np.square(_np_float32(grad)), dtype=np.float64))
+    if total_sq <= 0.0:
+        return grads_tree
+    total_norm = math.sqrt(total_sq)
+    if total_norm <= max_norm:
+        return grads_tree
+    scale = max_norm / (total_norm + 1e-12)
+    return tree_unflatten([(k, g * scale) for k, g in flat.items()])
+
+
+def main() -> None:
+    # ==============================================================================
+    # TOKENIZER + VALIDATION METRIC SETUP
+    # ==============================================================================
+    args = Hyperparameters()
+    out_dir = Path(args.out_dir)
+    out_dir.mkdir(parents=True, exist_ok=True)
+    logfile = out_dir / f"{args.run_id}.txt"
+    print(logfile)
+
+    def log(msg: str, console: bool = True) -> None:
+        if console:
+            print(msg)
+        with logfile.open("a", encoding="utf-8") as f:
+            print(msg, file=f)
+
+    code = Path(__file__).read_text(encoding="utf-8")
+    log(code, console=False)
+    log("=" * 100, console=False)
+    log(f"Running Python {sys.version}", console=False)
+    log(f"Running MLX {mx.__version__}", console=False)
+    log("=" * 100, console=False)
+
+    if not args.tie_embeddings:
+        raise NotImplementedError("train_gpt_mlx.py only supports tied embeddings")
+    if not args.tokenizer_path.endswith(".model"):
+        raise ValueError(f"TOKENIZER_PATH must point to a SentencePiece .model file: {args.tokenizer_path}")
+    sp = spm.SentencePieceProcessor(model_file=args.tokenizer_path)
+    if int(sp.vocab_size()) != args.vocab_size:
+        raise ValueError(
+            f"VOCAB_SIZE={args.vocab_size} does not match tokenizer vocab_size={int(sp.vocab_size())}"
+        )
+    dataset_name, actual_train_files, expected_train_files = validate_dataset_tokenizer_pair(
+        args.data_path,
+        args.tokenizer_path,
+    )
+    val_tokens = load_validation_tokens(args.val_files, args.train_seq_len)
+
+    base_bytes_lut, has_leading_space_lut, is_boundary_token_lut = build_sentencepiece_luts(
+        sp, args.vocab_size
+    )
+
+    # ==============================================================================
+    # TRAINING SETUP
+    # ==============================================================================
+    mx.random.seed(args.seed)
+
+    train_loader = TokenLoader(args.train_files, log_fn=log, dataset_name=dataset_name)
+
+    # ==============================================================================
+    # MODEL + OPTIMIZER SETUP
+    # ==============================================================================
+    model = GPT(
+        vocab_size=args.vocab_size,
+        num_layers=args.num_layers,
+        dim=args.model_dim,
+        num_heads=args.num_heads,
+        num_kv_heads=args.num_kv_heads,
+        mlp_mult=args.mlp_mult,
+        logit_chunk_tokens=args.logit_chunk_tokens,
+        logit_softcap=args.logit_softcap,
+        rope_base=args.rope_base,
+        tied_embed_init_std=args.tied_embed_init_std,
+        qk_gain_init=args.qk_gain_init,
+    )
+    opt = SplitOptimizers(model, args)
+
+    # ==============================================================================
+    # COMPILED TRAIN / EVAL FUNCTIONS (MLX)
+    # ==============================================================================
+    # The crucial MLX detail is capture scope: this model contains non-trainable arrays too (for example
+    # inside RoPE modules), so compiling only against trainable parameters throws "uncaptured inputs".
+    # Compiling the model-bound functions and capturing the full model state fixes that while still
+    # returning gradients only for trainable parameters via nn.value_and_grad(...).
+    compiled_loss = mx.compile(lambda x, y: model.loss(x, y), inputs=model.state, outputs=model.state)
+    compiled_loss_and_grad = mx.compile(
+        nn.value_and_grad(model, lambda x, y: model.loss(x, y)),
+        inputs=model.state,
+        outputs=model.state,
+    )
+
+    # Print config once so logs are self-describing.
+    n_params = sum(int(np.prod(p.shape)) for _, p in tree_flatten(model.parameters()))
+    log(f"run_id:{args.run_id}")
+    log(f"mlx_version:{mx.__version__}")
+    log(f"train_loader:shards pattern={args.train_files}")
+    log(f"val_loader:shards pattern={args.val_files} tokens:{val_tokens.size - 1}")
+    if expected_train_files is None:
+        log(f"train_loader:dataset:{dataset_name} train_shards:{actual_train_files}")
+    elif actual_train_files < expected_train_files:
+        log(
+            f"WARNING: train_loader:subset dataset:{dataset_name} "
+            f"train_shards:{actual_train_files}/{expected_train_files} "
+            f"new epochs will arrive sooner than the full dataset"
+        )
+    else:
+        log(f"train_loader:dataset:{dataset_name} train_shards:{actual_train_files}/{expected_train_files}")
+    log(f"tokenizer_path:{args.tokenizer_path}")
+    log(
+        f"model_params:{n_params} vocab_size:{args.vocab_size} layers:{args.num_layers} "
+        f"dim:{args.model_dim} heads:{args.num_heads} kv_heads:{args.num_kv_heads} "
+        f"seq_len:{args.train_seq_len} tie_embeddings:{args.tie_embeddings}"
+    )
+    log(
+        f"iterations:{args.iterations} train_batch_tokens:{args.train_batch_tokens} grad_accum_steps:{args.grad_accum_steps} "
+        f"microbatch_tokens:{args.microbatch_tokens} microbatch_batch_size:{args.microbatch_tokens // args.train_seq_len} "
+        f"val_batch_size:{args.val_batch_size} "
+        f"warmup_steps:{args.warmup_steps} max_wallclock_seconds:{args.max_wallclock_seconds:.3f}"
+    )
+    log(f"mlx_max_microbatch_tokens:{args.mlx_max_microbatch_tokens}")
+    log(
+        f"optimizer:muon+adam muon_matrix_params:{len(opt.matrix_keys)} scalar_params:{len(opt.scalar_keys)} "
+        f"embed_lr:{args.tied_embed_lr} "
+        f"matrix_lr:{args.matrix_lr} scalar_lr:{args.scalar_lr} "
+        f"muon_momentum:{args.muon_momentum} muon_steps:{args.muon_backend_steps}"
+    )
+    log(f"val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path={args.tokenizer_path}")
+    log(f"compute_dtype:{COMPUTE_DTYPE} compile:True")
+    log(
+        f"dtypes tok_emb:{model.tok_emb.weight.dtype} "
+        f"linear_weight:{model.blocks[0].attn.c_q.weight.dtype} "
+        f"skip_weights:{model.skip_weights.dtype}"
+    )
+
+    # ==============================================================================
+    # TRAINING LOOP
+    # ==============================================================================
+    if args.warmup_steps > 0:
+        # Warmup should only prime MLX compile/allocation paths. Updating parameters here forces us
+        # to snapshot and restore model/optimizer state, which is expensive on unified-memory Macs.
+        # Instead we run the real train shapes, force the loss/grads to materialize, and then reset
+        # the loader so measured training still starts from the true init and token window.
+        for warmup_step in range(args.warmup_steps):
+            accum: dict[str, mx.array] | None = None
+            warmup_loss = mx.array(0.0, dtype=mx.float32)
+            grad_scale = 1.0 / args.grad_accum_steps
+            for _ in range(args.grad_accum_steps):
+                warmup_loss, grads = loss_and_grad_chunked(args, train_loader, compiled_loss_and_grad)
+                accum = accumulate_flat_grads(accum, grads, grad_scale)
+            mx.eval(warmup_loss, accum)
+            mx.synchronize()
+            if args.warmup_steps <= 20 or (warmup_step + 1) % 10 == 0 or warmup_step + 1 == args.warmup_steps:
+                log(f"warmup_step:{warmup_step + 1}/{args.warmup_steps}")
+
+        # Prime the standalone eval graph once too. It is compiled separately from value_and_grad.
+        val_batch_tokens = args.val_batch_size // args.grad_accum_steps
+        if val_batch_tokens < args.train_seq_len:
+            raise ValueError(
+                "VAL_BATCH_SIZE must provide at least one sequence; "
+                f"got VAL_BATCH_SIZE={args.val_batch_size}, GRAD_ACCUM_STEPS={args.grad_accum_steps}, "
+                f"TRAIN_SEQ_LEN={args.train_seq_len}"
+            )
+        warm_val_seqs = min(val_batch_tokens // args.train_seq_len, (val_tokens.size - 1) // args.train_seq_len)
+        warm_chunk = val_tokens[: warm_val_seqs * args.train_seq_len + 1]
+        x_val = mx.array(warm_chunk[:-1].reshape(-1, args.train_seq_len), dtype=mx.int32)
+        y_val = mx.array(warm_chunk[1:].reshape(-1, args.train_seq_len), dtype=mx.int32)
+        warm_val_loss = compiled_loss(x_val, y_val)
+        mx.eval(warm_val_loss)
+        mx.synchronize()
+
+        train_loader = TokenLoader(args.train_files, log_fn=log, dataset_name=dataset_name)
+
+    train_time_ms = 0.0
+    max_wallclock_ms = 1000.0 * args.max_wallclock_seconds if args.max_wallclock_seconds > 0 else None
+    stop_after_step: int | None = None
+    t0 = time.perf_counter()
+    step = 0
+    while True:
+        last_step = step == args.iterations or (stop_after_step is not None and step >= stop_after_step)
+        if last_step or (args.val_loss_every > 0 and step % args.val_loss_every == 0):
+            # Validation always scans the same fixed full validation split.
+            val_loss, val_bpb = eval_val(
+                args,
+                compiled_loss,
+                val_tokens,
+                base_bytes_lut,
+                has_leading_space_lut,
+                is_boundary_token_lut,
+            )
+            train_time_ms += 1000.0 * (time.perf_counter() - t0)
+            if step % 25 == 0 or last_step:
+                log(
+                    f"step:{step}/{args.iterations} val_loss:{val_loss:.4f} val_bpb:{val_bpb:.4f} "
+                    f"train_time:{train_time_ms:.0f}ms step_avg:{train_time_ms / max(step, 1):.2f}ms"
+                )
+            t0 = time.perf_counter()
+        if last_step:
+            if stop_after_step is not None and step < args.iterations:
+                log(f"stopping_early: wallclock_cap train_time:{train_time_ms:.0f}ms step:{step}/{args.iterations}")
+            break
+
+        lr_mul = args.lr_mul(step, train_time_ms + 1000.0 * (time.perf_counter() - t0))
+        step_t0 = time.perf_counter()
+
+        accum: dict[str, mx.array] | None = None
+        train_loss = mx.array(0.0, dtype=mx.float32)
+        grad_scale = 1.0 / args.grad_accum_steps
+        for _ in range(args.grad_accum_steps):
+            loss, grads = loss_and_grad_chunked(args, train_loader, compiled_loss_and_grad)
+            accum = accumulate_flat_grads(accum, grads, grad_scale)
+            train_loss = train_loss + loss.astype(mx.float32) * grad_scale
+
+        grads = tree_unflatten(list(accum.items()))
+        grads = clip_grad_tree(grads, args.grad_clip_norm)
+        train_loss_value = float(train_loss.item())
+        opt.step(model, grads, step=step, lr_mul=lr_mul)
+        mx.synchronize()
+
+        step_ms = 1000.0 * (time.perf_counter() - step_t0)
+        approx_train_time_ms = train_time_ms + 1000.0 * (time.perf_counter() - t0)
+        tok_s = args.train_batch_tokens / (step_ms / 1000.0)
+        step += 1
+        if args.train_log_every > 0 and (step <= 10 or step % args.train_log_every == 0 or stop_after_step is not None):
+            log(
+                f"step:{step}/{args.iterations} train_loss:{train_loss_value:.4f} "
+                f"train_time:{approx_train_time_ms:.0f}ms step_avg:{approx_train_time_ms / step:.2f}ms tok_s:{tok_s:.0f}"
+            )
+        if max_wallclock_ms is not None and stop_after_step is None and approx_train_time_ms >= max_wallclock_ms:
+            stop_after_step = step
+
+    # ==============================================================================
+    # FINAL SERIALIZATION + QUANTIZED ROUNDTRIP EVAL
+    # ==============================================================================
+    # We always write a raw artifact and a quantized artifact, then validate the
+    # quantized roundtrip directly by loading the dequantized tensors back into the
+    # model and running one final validation pass.
+    out_path = out_dir / f"{args.run_id}_mlx_model.npz"
+    flat_state = {k: v for k, v in tree_flatten(model.state)}
+    mx.savez(str(out_path), **flat_state)
+    log(f"saved_model:{out_path} bytes:{out_path.stat().st_size}")
+
+    quant_obj, quant_stats = quantize_state_dict_int8(flat_state)
+    quant_raw = pickle.dumps(quant_obj, protocol=pickle.HIGHEST_PROTOCOL)
+    quant_blob = zlib.compress(quant_raw, level=9)
+    quant_serialized_bytes = len(quant_raw)
+    quant_path = out_dir / f"{args.run_id}_mlx_model.int8.ptz"
+    with quant_path.open("wb") as f:
+        f.write(quant_blob)
+    quant_file_bytes = quant_path.stat().st_size
+    ratio = quant_stats["baseline_tensor_bytes"] / max(quant_stats["int8_payload_bytes"], 1)
+    log(
+        f"serialized_model_int8_zlib:{quant_file_bytes} bytes "
+        f"(payload:{quant_stats['int8_payload_bytes']} raw_pickle:{quant_serialized_bytes} payload_ratio:{ratio:.2f}x)"
+    )
+
+    with quant_path.open("rb") as f:
+        quant_blob_disk = f.read()
+    quant_flat = dequantize_state_dict_int8(pickle.loads(zlib.decompress(quant_blob_disk)))
+    model.update(tree_unflatten(list(quant_flat.items())))
+    q_t0 = time.perf_counter()
+    q_val_loss, q_val_bpb = eval_val(
+        args,
+        compiled_loss,
+        val_tokens,
+        base_bytes_lut,
+        has_leading_space_lut,
+        is_boundary_token_lut,
+    )
+    q_eval_ms = 1000.0 * (time.perf_counter() - q_t0)
+    log(f"final_int8_zlib_roundtrip val_loss:{q_val_loss:.4f} val_bpb:{q_val_bpb:.4f} eval_time:{q_eval_ms:.0f}ms")
+    log(f"final_int8_zlib_roundtrip_exact val_loss:{q_val_loss:.8f} val_bpb:{q_val_bpb:.8f}")
+
+
+if __name__ == "__main__":
+    main()
author	Will DePue <williamd@openai.com>	2026-03-18 09:32:01 -0700
committer	Will DePue <williamd@openai.com>	2026-03-18 09:32:01 -0700
commit	a15093adad328a650d421e53c078cbd2c45beb0e (patch)
tree	e054c4bde12b89e6d3b39d611d9caadabc7f7234