summaryrefslogtreecommitdiff
path: root/scripts/slurm_train.sh
AgeCommit message (Collapse)Author
2026-02-10Add auto-resume checkpointing, S1/S2 configs, and experiment resultsYurenHao0426
- Auto-resume: find latest checkpoint in save_dir on startup - SIGUSR1 handler: save checkpoint before SLURM timeout - S1 config (constant tau=5, identity init verification) - S2 config (constant tau=2, gradient flow check) - Experiment results tracker with S0/S1 data - Speed estimates and experiment plan Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-09Fix init state: add logit_bias so A≈1 at init (dense connectivity)YurenHao0426
- Add learnable logit_bias=15.0 to PredictorMLP, so σ(15/τ_init) ≈ 0.95 at init, reproducing dense connectivity instead of random A≈0.25 - Fix dtype mismatch: cast A to model dtype (bfloat16) in DAGFormerOLMo.forward - Fix YAML lr parsing: add type coercion in TrainConfig.from_yaml - Fix device mismatch: call self.to(device) in StructurePredictor.__init__ - Add python -u for unbuffered SLURM output, TOKENIZERS_PARALLELISM=false - Delete stale eval_cache.pt (built with buggy MLP input code) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-09Initial implementation: DAGFormer Phase 1YurenHao0426
- olmo_graph.py: Modified OLMo2-1B forward with per-head routing via 256x256 adjacency matrix A - Proportional attribution for post-norm decomposition - All 6 GPU sanity checks pass (baseline diff = 0.000001) - predictor.py: Qwen3-Embedding encoder + MLP decoder + Gumbel-Sigmoid + cascading gate - pipeline.py: End-to-end glue (predictor -> A -> OLMo -> NLL) - trainer.py: Full training loop with DDP, gradient accumulation, eval, checkpointing - dolma.py: Streaming Dolma v1.7 with sequence packing - 43/43 unit tests pass Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>