=== Reward Model Comparison Test === Local: Llama-3.1-8B-Instruct API: GPT-4o-mini ================================================================================ Reward Model Comparison: Llama-3.1-8B vs GPT-4o-mini ================================================================================ Loading models/llama-3.1-8b-instruct... Model loaded. Running 12 test cases... --- Test 1/12: neg_constraint_restate - format preference --- Expected: neg_constraint_restate Local (Llama): neg_correction (conf=0.80) [1.48s] ✗ GPT-4o-mini: neg_correction (conf=0.90) [1.72s] ✗ Agreement: Yes --- Test 2/12: neg_constraint_restate - step by step --- Expected: neg_constraint_restate Local (Llama): neg_constraint_restate (conf=0.90) [1.16s] ✓ GPT-4o-mini: neg_constraint_restate (conf=0.90) [0.95s] ✓ Agreement: Yes --- Test 3/12: neg_correction - wrong answer --- Expected: neg_correction Local (Llama): neg_correction (conf=0.90) [1.03s] ✓ GPT-4o-mini: neg_correction (conf=0.90) [1.01s] ✓ Agreement: Yes --- Test 4/12: neg_confusion - unclear explanation --- Expected: neg_confusion Local (Llama): neg_confusion (conf=0.80) [1.20s] ✓ GPT-4o-mini: neg_confusion (conf=0.90) [1.14s] ✓ Agreement: Yes --- Test 5/12: pos_praise - explicit thanks --- Expected: pos_praise Local (Llama): pos_praise (conf=1.00) [0.97s] ✓ GPT-4o-mini: pos_praise (conf=0.95) [1.32s] ✓ Agreement: Yes --- Test 6/12: pos_praise - great explanation --- Expected: pos_praise Local (Llama): pos_praise (conf=1.00) [0.97s] ✓ GPT-4o-mini: pos_praise (conf=0.95) [1.02s] ✓ Agreement: Yes --- Test 7/12: pos_progress - follow-up question --- Expected: pos_progress Local (Llama): pos_progress (conf=0.90) [1.35s] ✓ GPT-4o-mini: pos_progress (conf=0.90) [1.15s] ✓ Agreement: Yes --- Test 8/12: pos_progress - extension --- Expected: pos_progress Local (Llama): pos_progress (conf=0.90) [1.33s] ✓ GPT-4o-mini: pos_progress (conf=0.90) [1.25s] ✓ Agreement: Yes --- Test 9/12: neutral - minimal response --- Expected: neutral Local (Llama): neutral (conf=0.80) [1.19s] ✓ GPT-4o-mini: neutral (conf=0.90) [1.24s] ✓ Agreement: Yes --- Test 10/12: topic_shift - new topic --- Expected: topic_shift Local (Llama): topic_shift (conf=0.90) [1.21s] ✓ GPT-4o-mini: topic_shift (conf=0.90) [1.61s] ✓ Agreement: Yes --- Test 11/12: neg_constraint_restate - language preference --- Expected: neg_constraint_restate Local (Llama): neg_constraint_restate (conf=0.80) [1.38s] ✓ GPT-4o-mini: neg_constraint_restate (conf=0.90) [2.55s] ✓ Agreement: Yes --- Test 12/12: neg_correction - incomplete answer --- Expected: neg_correction Local (Llama): neg_correction (conf=0.80) [1.00s] ✓ GPT-4o-mini: neg_correction (conf=0.90) [2.35s] ✓ Agreement: Yes ================================================================================ SUMMARY ================================================================================ Local (Llama-3.1-8B) Accuracy: 91.7% (11/12) GPT-4o-mini Accuracy: 91.7% (11/12) Agreement Rate: 100.0% Local Avg Time: 1.19s GPT Avg Time: 1.44s Speedup: 1.2x faster (local) Local Model Errors (1): - neg_constraint_restate - format preference: Got neg_correction, Expected neg_constraint_restate GPT Model Errors (1): - neg_constraint_restate - format preference: Got neg_correction, Expected neg_constraint_restate ================================================================================ === Test Complete ===