dagformer.git/tests, branch main

Fix cascading gate: exempt layer 0 from disconnection check

2026-02-09T20:40:31+00:00

Layer 0 has no incoming edges structurally (no prior layers), but
receives the embedding as input. The cascading gate was killing its
outgoing edges (hard: g=0, soft: g=0.5), causing nll_hard to be
~2x worse than baseline. Fix: set g=1 for layer 0 nodes.

Co-Authored-By: Claude Opus 4.6

Fix NLL double-shift bug and head weight init

2026-02-09T18:28:55+00:00

- NLL loss was shifting labels twice (olmo_labels already shifted,
  then code did logits[:,:-1] vs labels[:,1:]). Fixed in 9 locations:
  trainer, pipeline, olmo_graph, sanity_check, eval.
- Head U/V weights init with std=0.01 (was Kaiming ~5.7 std) so
  UV^T≈0 at init, ensuring Z≈logit_bias=15 and A≈0.953.
- Updated SVD rank test to subtract logit_bias before checking.

Co-Authored-By: Claude Opus 4.6

Initial implementation: DAGFormer Phase 1

2026-02-09T17:00:39+00:00

- olmo_graph.py: Modified OLMo2-1B forward with per-head routing via 256x256 adjacency matrix A
  - Proportional attribution for post-norm decomposition
  - All 6 GPU sanity checks pass (baseline diff = 0.000001)
- predictor.py: Qwen3-Embedding encoder + MLP decoder + Gumbel-Sigmoid + cascading gate
- pipeline.py: End-to-end glue (predictor -> A -> OLMo -> NLL)
- trainer.py: Full training loop with DDP, gradient accumulation, eval, checkpointing
- dolma.py: Streaming Dolma v1.7 with sequence packing
- 43/43 unit tests pass

Co-Authored-By: Claude Opus 4.6