summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
15 hoursFix cascading gate: exempt layer 0 from disconnection checkHEADmainYurenHao0426
Layer 0 has no incoming edges structurally (no prior layers), but receives the embedding as input. The cascading gate was killing its outgoing edges (hard: g=0, soft: g=0.5), causing nll_hard to be ~2x worse than baseline. Fix: set g=1 for layer 0 nodes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
17 hoursFix NLL double-shift bug and head weight initYurenHao0426
- NLL loss was shifting labels twice (olmo_labels already shifted, then code did logits[:,:-1] vs labels[:,1:]). Fixed in 9 locations: trainer, pipeline, olmo_graph, sanity_check, eval. - Head U/V weights init with std=0.01 (was Kaiming ~5.7 std) so UV^T≈0 at init, ensuring Z≈logit_bias=15 and A≈0.953. - Updated SVD rank test to subtract logit_bias before checking. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
18 hoursFix init state: add logit_bias so A≈1 at init (dense connectivity)YurenHao0426
- Add learnable logit_bias=15.0 to PredictorMLP, so σ(15/τ_init) ≈ 0.95 at init, reproducing dense connectivity instead of random A≈0.25 - Fix dtype mismatch: cast A to model dtype (bfloat16) in DAGFormerOLMo.forward - Fix YAML lr parsing: add type coercion in TrainConfig.from_yaml - Fix device mismatch: call self.to(device) in StructurePredictor.__init__ - Add python -u for unbuffered SLURM output, TOKENIZERS_PARALLELISM=false - Delete stale eval_cache.pt (built with buggy MLP input code) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
19 hoursInitial implementation: DAGFormer Phase 1YurenHao0426
- olmo_graph.py: Modified OLMo2-1B forward with per-head routing via 256x256 adjacency matrix A - Proportional attribution for post-norm decomposition - All 6 GPU sanity checks pass (baseline diff = 0.000001) - predictor.py: Qwen3-Embedding encoder + MLP decoder + Gumbel-Sigmoid + cascading gate - pipeline.py: End-to-end glue (predictor -> A -> OLMo -> NLL) - trainer.py: Full training loop with DDP, gradient accumulation, eval, checkpointing - dolma.py: Streaming Dolma v1.7 with sequence packing - 43/43 unit tests pass Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>