=== Reward Model Comparison Test === Local: Qwen2.5-1.5B-Instruct API: GPT-5-nano ================================================================================ Reward Model Comparison: Qwen2.5-1.5B vs GPT-5-nano ================================================================================ Loading models/qwen2.5-1.5b-instruct... Model loaded. Running 12 test cases... --- Test 1/12: neg_constraint_restate - format preference --- Expected: neg_constraint_restate Local (Qwen): neg_constraint_restate (conf=0.95) [2.42s] ✓ GPT-5-nano: neutral (conf=0.00) [4.97s] ✗ Agreement: NO --- Test 2/12: neg_constraint_restate - step by step --- Expected: neg_constraint_restate Local (Qwen): neutral (conf=0.00) [1.04s] ✗ GPT-5-nano: neutral (conf=0.00) [4.43s] ✗ Agreement: Yes --- Test 3/12: neg_correction - wrong answer --- Expected: neg_correction Local (Qwen): neutral (conf=0.00) [0.93s] ✗ GPT-5-nano: neutral (conf=0.00) [3.53s] ✗ Agreement: Yes --- Test 4/12: neg_confusion - unclear explanation --- Expected: neg_confusion Local (Qwen): neutral (conf=0.00) [0.89s] ✗ GPT-5-nano: neutral (conf=0.00) [2.52s] ✗ Agreement: Yes --- Test 5/12: pos_praise - explicit thanks --- Expected: pos_praise Local (Qwen): neutral (conf=0.00) [0.88s] ✗ GPT-5-nano: pos_praise (conf=0.92) [3.57s] ✓ Agreement: NO --- Test 6/12: pos_praise - great explanation --- Expected: pos_praise Local (Qwen): neutral (conf=0.00) [0.92s] ✗ GPT-5-nano: pos_praise (conf=0.85) [2.19s] ✓ Agreement: NO --- Test 7/12: pos_progress - follow-up question --- Expected: pos_progress Local (Qwen): neutral (conf=0.00) [0.95s] ✗ GPT-5-nano: neutral (conf=0.00) [3.76s] ✗ Agreement: Yes --- Test 8/12: pos_progress - extension --- Expected: pos_progress Local (Qwen): neutral (conf=0.00) [1.20s] ✗ GPT-5-nano: neutral (conf=0.00) [3.37s] ✗ Agreement: Yes --- Test 9/12: neutral - minimal response --- Expected: neutral Local (Qwen): neutral (conf=0.00) [0.78s] ✓ GPT-5-nano: neutral (conf=0.00) [4.21s] ✓ Agreement: Yes --- Test 10/12: topic_shift - new topic --- Expected: topic_shift Local (Qwen): neutral (conf=0.00) [1.02s] ✗ GPT-5-nano: neutral (conf=0.00) [3.86s] ✗ Agreement: Yes --- Test 11/12: neg_constraint_restate - language preference --- Expected: neg_constraint_restate Local (Qwen): neutral (conf=0.00) [0.94s] ✗ GPT-5-nano: neutral (conf=0.00) [2.31s] ✗ Agreement: Yes --- Test 12/12: neg_correction - incomplete answer --- Expected: neg_correction Local (Qwen): neutral (conf=0.00) [0.90s] ✗ GPT-5-nano: neutral (conf=0.00) [4.49s] ✗ Agreement: Yes ================================================================================ SUMMARY ================================================================================ Local (Qwen2.5-1.5B) Accuracy: 16.7% (2/12) GPT-5-nano Accuracy: 25.0% (3/12) Agreement Rate: 75.0% Local Avg Time: 1.07s GPT Avg Time: 3.60s Speedup: 3.4x faster (local) Disagreements (3): - neg_constraint_restate - format preference: Local=neg_constraint_restate, GPT=neutral, Expected=neg_constraint_restate - pos_praise - explicit thanks: Local=neutral, GPT=pos_praise, Expected=pos_praise - pos_praise - great explanation: Local=neutral, GPT=pos_praise, Expected=pos_praise Local Model Errors (10): - neg_constraint_restate - step by step: Got neutral, Expected neg_constraint_restate - neg_correction - wrong answer: Got neutral, Expected neg_correction - neg_confusion - unclear explanation: Got neutral, Expected neg_confusion - pos_praise - explicit thanks: Got neutral, Expected pos_praise - pos_praise - great explanation: Got neutral, Expected pos_praise - pos_progress - follow-up question: Got neutral, Expected pos_progress - pos_progress - extension: Got neutral, Expected pos_progress - topic_shift - new topic: Got neutral, Expected topic_shift - neg_constraint_restate - language preference: Got neutral, Expected neg_constraint_restate - neg_correction - incomplete answer: Got neutral, Expected neg_correction GPT Model Errors (9): - neg_constraint_restate - format preference: Got neutral, Expected neg_constraint_restate - neg_constraint_restate - step by step: Got neutral, Expected neg_constraint_restate - neg_correction - wrong answer: Got neutral, Expected neg_correction - neg_confusion - unclear explanation: Got neutral, Expected neg_confusion - pos_progress - follow-up question: Got neutral, Expected pos_progress - pos_progress - extension: Got neutral, Expected pos_progress - topic_shift - new topic: Got neutral, Expected topic_shift - neg_constraint_restate - language preference: Got neutral, Expected neg_constraint_restate - neg_correction - incomplete answer: Got neutral, Expected neg_correction ================================================================================ === Test Complete ===