| Age | Commit message (Collapse) | Author |
|
(val_bpb=1.1748) (#60)
* Add NTK Eval + Overtone Init submission (1.2160 BPB)
Train@1024 with overtone embedding init and phase-transition residual
mixing, eval@2048 with NTK-aware dynamic RoPE scaling. Mean val_bpb
1.2160 across 3 seeds (p=0.0012 for 0.0194-nat improvement over baseline).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Update submission: Muon WD + NTK Eval + Overtone Init (1.2094 BPB, p=0.0002)
* Update submission: 10-Layer + Muon WD + NTK Eval + Overtone Init (1.2029 BPB, p=0.0006)
* Update submission: FP16 Embed + 10L + Muon WD + NTK + Overtone (1.2008 BPB)
* Update submission: 1.2000 BPB — FP16 Embed + 10L + Muon WD + NTK@1408 + Overtone
* Update: 1.1748 BPB — Sliding Window + FP16 Embed + 10L + Muon WD + Overtone
---------
Co-authored-by: notapplica <notapplica@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
Co-authored-by: spokane-way <spokane@way>
|
|
The window_starts filter dropped windows shorter than stride,
silently skipping up to (stride-1) tokens at the end of the
validation set. Now includes all windows with >= 1 scoreable
token, and clamps the score start for short final windows.
|
|
|
|
* SOTA attempt
* Improve score on SXM
---------
Co-authored-by: spokane-way <spokane@way>
|
|
keep tok_emb.weight in fp16 during int8 export (kills the quant gap),
shrink MLP hidden to 992 to fit under 16MB, bump warmdown to 3600
and matrix LR to 0.06.
tested on 8xH100 SXM (2 seeds) and 8xH200 SXM (3 seeds).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
|
|