Device: cuda:0 === BP frozen-blocks baseline (4 random-init transformer blocks, frozen) === BP-frozen-blocks: 16266/809354 params trainable BP-frozen ep 1: test_acc=0.3762 BP-frozen ep 5: test_acc=0.4724 BP-frozen ep 10: test_acc=0.4961 BP-frozen ep 15: test_acc=0.5189 BP-frozen ep 20: test_acc=0.5252 BP-frozen ep 25: test_acc=0.5366 BP-frozen ep 30: test_acc=0.5402 FINAL BP-frozen-blocks acc: 0.5402 === DFA frozen-blocks baseline === DFA-frozen-blocks: 16266/809354 params trainable DFA-frozen ep 1: test_acc=0.2529 DFA-frozen ep 5: test_acc=0.2477 DFA-frozen ep 10: test_acc=0.2530 DFA-frozen ep 15: test_acc=0.2566 DFA-frozen ep 20: test_acc=0.2530 DFA-frozen ep 25: test_acc=0.2545 DFA-frozen ep 30: test_acc=0.2554 FINAL DFA-frozen-blocks acc: 0.2554 === Summary === BP-frozen-blocks: 0.5402 (chance=0.10) DFA-frozen-blocks: 0.2554 Compare to ViT-Mini 4-block trainable (3-seed avg): BP=0.792, DFA=0.237 Compare to ViT-Mini 0-block (shallow baseline): BP=0.10, DFA=0.10 Interpretation: If DFA-frozen-blocks ≈ 0.237: blocks are passengers, DFA is just learning patch_embed+head If DFA-frozen-blocks << 0.237: trainable blocks ARE doing learned work If DFA-frozen-blocks ~ 0.10: untrained blocks add no useful mixing (less informative)