=== full + component toggles (ms/step, B=24, C512) ===
/home/yurenh2/miniconda3/lib/python3.13/site-packages/torch/autograd/graph.py:865: UserWarning: Attempting to run cuBLAS, but there was no current CUDA context! Attempting to set the primary context... (Triggered internally at /pytorch/aten/src/ATen/cuda/CublasHandlePool.cpp:330.)
  return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
FULL ep_step: 7266
  -jacreg: 7242
  -resreg: 7312
  -t1max(no refine): 5886
  t2sel=80: 7384
  t2sel=40: 4485
  plain nudge holo=0 T2=20: 3179
  free relax T1=150 alone: 740
  free relax T1=300 alone: 1480
=== batch sweep (full) ===
  B=8: 2353 ms  (294.1/sample)
  B=24: 7405 ms  (308.5/sample)
  B=48: 14496 ms  (302.0/sample)
=== compile free relax ===
  free relax T1=150 COMPILED: 507
=== bf16 full ===
  full bf16: ERR RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
DONE