1 files changed, 126 insertions, 0 deletions
diff --git a/scripts/test_reward_cmp_15667076.out b/scripts/test_reward_cmp_15667076.out
new file mode 100644
index 0000000..67693b1
--- /dev/null
+++ b/scripts/test_reward_cmp_15667076.out
@@ -0,0 +1,126 @@
+=== Reward Model Comparison Test ===
+Local: Qwen2.5-1.5B-Instruct
+API: GPT-5-nano
+
+================================================================================
+Reward Model Comparison: Qwen2.5-1.5B vs GPT-5-nano
+================================================================================
+
+Loading models/qwen2.5-1.5b-instruct...
+Model loaded.
+Running 12 test cases...
+
+--- Test 1/12: neg_constraint_restate - format preference ---
+Expected: neg_constraint_restate
+  Local (Qwen): neg_constraint_restate (conf=0.95) [2.42s] ✓
+  GPT-5-nano:   neutral (conf=0.00) [4.97s] ✗
+  Agreement: NO
+
+--- Test 2/12: neg_constraint_restate - step by step ---
+Expected: neg_constraint_restate
+  Local (Qwen): neutral (conf=0.00) [1.04s] ✗
+  GPT-5-nano:   neutral (conf=0.00) [4.43s] ✗
+  Agreement: Yes
+
+--- Test 3/12: neg_correction - wrong answer ---
+Expected: neg_correction
+  Local (Qwen): neutral (conf=0.00) [0.93s] ✗
+  GPT-5-nano:   neutral (conf=0.00) [3.53s] ✗
+  Agreement: Yes
+
+--- Test 4/12: neg_confusion - unclear explanation ---
+Expected: neg_confusion
+  Local (Qwen): neutral (conf=0.00) [0.89s] ✗
+  GPT-5-nano:   neutral (conf=0.00) [2.52s] ✗
+  Agreement: Yes
+
+--- Test 5/12: pos_praise - explicit thanks ---
+Expected: pos_praise
+  Local (Qwen): neutral (conf=0.00) [0.88s] ✗
+  GPT-5-nano:   pos_praise (conf=0.92) [3.57s] ✓
+  Agreement: NO
+
+--- Test 6/12: pos_praise - great explanation ---
+Expected: pos_praise
+  Local (Qwen): neutral (conf=0.00) [0.92s] ✗
+  GPT-5-nano:   pos_praise (conf=0.85) [2.19s] ✓
+  Agreement: NO
+
+--- Test 7/12: pos_progress - follow-up question ---
+Expected: pos_progress
+  Local (Qwen): neutral (conf=0.00) [0.95s] ✗
+  GPT-5-nano:   neutral (conf=0.00) [3.76s] ✗
+  Agreement: Yes
+
+--- Test 8/12: pos_progress - extension ---
+Expected: pos_progress
+  Local (Qwen): neutral (conf=0.00) [1.20s] ✗
+  GPT-5-nano:   neutral (conf=0.00) [3.37s] ✗
+  Agreement: Yes
+
+--- Test 9/12: neutral - minimal response ---
+Expected: neutral
+  Local (Qwen): neutral (conf=0.00) [0.78s] ✓
+  GPT-5-nano:   neutral (conf=0.00) [4.21s] ✓
+  Agreement: Yes
+
+--- Test 10/12: topic_shift - new topic ---
+Expected: topic_shift
+  Local (Qwen): neutral (conf=0.00) [1.02s] ✗
+  GPT-5-nano:   neutral (conf=0.00) [3.86s] ✗
+  Agreement: Yes
+
+--- Test 11/12: neg_constraint_restate - language preference ---
+Expected: neg_constraint_restate
+  Local (Qwen): neutral (conf=0.00) [0.94s] ✗
+  GPT-5-nano:   neutral (conf=0.00) [2.31s] ✗
+  Agreement: Yes
+
+--- Test 12/12: neg_correction - incomplete answer ---
+Expected: neg_correction
+  Local (Qwen): neutral (conf=0.00) [0.90s] ✗
+  GPT-5-nano:   neutral (conf=0.00) [4.49s] ✗
+  Agreement: Yes
+
+================================================================================
+SUMMARY
+================================================================================
+Local (Qwen2.5-1.5B) Accuracy: 16.7% (2/12)
+GPT-5-nano Accuracy:           25.0% (3/12)
+Agreement Rate:                75.0%
+
+Local Avg Time: 1.07s
+GPT Avg Time:   3.60s
+Speedup:        3.4x faster (local)
+
+Disagreements (3):
+  - neg_constraint_restate - format preference: Local=neg_constraint_restate, GPT=neutral, Expected=neg_constraint_restate
+  - pos_praise - explicit thanks: Local=neutral, GPT=pos_praise, Expected=pos_praise
+  - pos_praise - great explanation: Local=neutral, GPT=pos_praise, Expected=pos_praise
+
+Local Model Errors (10):
+  - neg_constraint_restate - step by step: Got neutral, Expected neg_constraint_restate
+  - neg_correction - wrong answer: Got neutral, Expected neg_correction
+  - neg_confusion - unclear explanation: Got neutral, Expected neg_confusion
+  - pos_praise - explicit thanks: Got neutral, Expected pos_praise
+  - pos_praise - great explanation: Got neutral, Expected pos_praise
+  - pos_progress - follow-up question: Got neutral, Expected pos_progress
+  - pos_progress - extension: Got neutral, Expected pos_progress
+  - topic_shift - new topic: Got neutral, Expected topic_shift
+  - neg_constraint_restate - language preference: Got neutral, Expected neg_constraint_restate
+  - neg_correction - incomplete answer: Got neutral, Expected neg_correction
+
+GPT Model Errors (9):
+  - neg_constraint_restate - format preference: Got neutral, Expected neg_constraint_restate
+  - neg_constraint_restate - step by step: Got neutral, Expected neg_constraint_restate
+  - neg_correction - wrong answer: Got neutral, Expected neg_correction
+  - neg_confusion - unclear explanation: Got neutral, Expected neg_confusion
+  - pos_progress - follow-up question: Got neutral, Expected pos_progress
+  - pos_progress - extension: Got neutral, Expected pos_progress
+  - topic_shift - new topic: Got neutral, Expected topic_shift
+  - neg_constraint_restate - language preference: Got neutral, Expected neg_constraint_restate
+  - neg_correction - incomplete answer: Got neutral, Expected neg_correction
+
+================================================================================
+
+=== Test Complete ===