1 files changed, 104 insertions, 0 deletions
diff --git a/scripts/test_reward_cmp_15667126.out b/scripts/test_reward_cmp_15667126.out
new file mode 100644
index 0000000..1e10566
--- /dev/null
+++ b/scripts/test_reward_cmp_15667126.out
@@ -0,0 +1,104 @@
+=== Reward Model Comparison Test ===
+Local: Llama-3.1-8B-Instruct
+API: GPT-4o-mini
+
+================================================================================
+Reward Model Comparison: Llama-3.1-8B vs GPT-4o-mini
+================================================================================
+
+Loading models/llama-3.1-8b-instruct...
+Model loaded.
+Running 12 test cases...
+
+--- Test 1/12: neg_constraint_restate - format preference ---
+Expected: neg_constraint_restate
+  Local (Llama): neg_correction (conf=0.80) [1.48s] ✗
+  GPT-4o-mini:  neg_correction (conf=0.90) [1.72s] ✗
+  Agreement: Yes
+
+--- Test 2/12: neg_constraint_restate - step by step ---
+Expected: neg_constraint_restate
+  Local (Llama): neg_constraint_restate (conf=0.90) [1.16s] ✓
+  GPT-4o-mini:  neg_constraint_restate (conf=0.90) [0.95s] ✓
+  Agreement: Yes
+
+--- Test 3/12: neg_correction - wrong answer ---
+Expected: neg_correction
+  Local (Llama): neg_correction (conf=0.90) [1.03s] ✓
+  GPT-4o-mini:  neg_correction (conf=0.90) [1.01s] ✓
+  Agreement: Yes
+
+--- Test 4/12: neg_confusion - unclear explanation ---
+Expected: neg_confusion
+  Local (Llama): neg_confusion (conf=0.80) [1.20s] ✓
+  GPT-4o-mini:  neg_confusion (conf=0.90) [1.14s] ✓
+  Agreement: Yes
+
+--- Test 5/12: pos_praise - explicit thanks ---
+Expected: pos_praise
+  Local (Llama): pos_praise (conf=1.00) [0.97s] ✓
+  GPT-4o-mini:  pos_praise (conf=0.95) [1.32s] ✓
+  Agreement: Yes
+
+--- Test 6/12: pos_praise - great explanation ---
+Expected: pos_praise
+  Local (Llama): pos_praise (conf=1.00) [0.97s] ✓
+  GPT-4o-mini:  pos_praise (conf=0.95) [1.02s] ✓
+  Agreement: Yes
+
+--- Test 7/12: pos_progress - follow-up question ---
+Expected: pos_progress
+  Local (Llama): pos_progress (conf=0.90) [1.35s] ✓
+  GPT-4o-mini:  pos_progress (conf=0.90) [1.15s] ✓
+  Agreement: Yes
+
+--- Test 8/12: pos_progress - extension ---
+Expected: pos_progress
+  Local (Llama): pos_progress (conf=0.90) [1.33s] ✓
+  GPT-4o-mini:  pos_progress (conf=0.90) [1.25s] ✓
+  Agreement: Yes
+
+--- Test 9/12: neutral - minimal response ---
+Expected: neutral
+  Local (Llama): neutral (conf=0.80) [1.19s] ✓
+  GPT-4o-mini:  neutral (conf=0.90) [1.24s] ✓
+  Agreement: Yes
+
+--- Test 10/12: topic_shift - new topic ---
+Expected: topic_shift
+  Local (Llama): topic_shift (conf=0.90) [1.21s] ✓
+  GPT-4o-mini:  topic_shift (conf=0.90) [1.61s] ✓
+  Agreement: Yes
+
+--- Test 11/12: neg_constraint_restate - language preference ---
+Expected: neg_constraint_restate
+  Local (Llama): neg_constraint_restate (conf=0.80) [1.38s] ✓
+  GPT-4o-mini:  neg_constraint_restate (conf=0.90) [2.55s] ✓
+  Agreement: Yes
+
+--- Test 12/12: neg_correction - incomplete answer ---
+Expected: neg_correction
+  Local (Llama): neg_correction (conf=0.80) [1.00s] ✓
+  GPT-4o-mini:  neg_correction (conf=0.90) [2.35s] ✓
+  Agreement: Yes
+
+================================================================================
+SUMMARY
+================================================================================
+Local (Llama-3.1-8B) Accuracy: 91.7% (11/12)
+GPT-4o-mini Accuracy:          91.7% (11/12)
+Agreement Rate:                100.0%
+
+Local Avg Time: 1.19s
+GPT Avg Time:   1.44s
+Speedup:        1.2x faster (local)
+
+Local Model Errors (1):
+  - neg_constraint_restate - format preference: Got neg_correction, Expected neg_constraint_restate
+
+GPT Model Errors (1):
+  - neg_constraint_restate - format preference: Got neg_correction, Expected neg_constraint_restate
+
+================================================================================
+
+=== Test Complete ===