scripts/test_reward_cmp_15667076.out


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126

=== Reward Model Comparison Test ===
Local: Qwen2.5-1.5B-Instruct
API: GPT-5-nano

================================================================================
Reward Model Comparison: Qwen2.5-1.5B vs GPT-5-nano
================================================================================

Loading models/qwen2.5-1.5b-instruct...
Model loaded.
Running 12 test cases...

--- Test 1/12: neg_constraint_restate - format preference ---
Expected: neg_constraint_restate
  Local (Qwen): neg_constraint_restate (conf=0.95) [2.42s] ✓
  GPT-5-nano:   neutral (conf=0.00) [4.97s] ✗
  Agreement: NO

--- Test 2/12: neg_constraint_restate - step by step ---
Expected: neg_constraint_restate
  Local (Qwen): neutral (conf=0.00) [1.04s] ✗
  GPT-5-nano:   neutral (conf=0.00) [4.43s] ✗
  Agreement: Yes

--- Test 3/12: neg_correction - wrong answer ---
Expected: neg_correction
  Local (Qwen): neutral (conf=0.00) [0.93s] ✗
  GPT-5-nano:   neutral (conf=0.00) [3.53s] ✗
  Agreement: Yes

--- Test 4/12: neg_confusion - unclear explanation ---
Expected: neg_confusion
  Local (Qwen): neutral (conf=0.00) [0.89s] ✗
  GPT-5-nano:   neutral (conf=0.00) [2.52s] ✗
  Agreement: Yes

--- Test 5/12: pos_praise - explicit thanks ---
Expected: pos_praise
  Local (Qwen): neutral (conf=0.00) [0.88s] ✗
  GPT-5-nano:   pos_praise (conf=0.92) [3.57s] ✓
  Agreement: NO

--- Test 6/12: pos_praise - great explanation ---
Expected: pos_praise
  Local (Qwen): neutral (conf=0.00) [0.92s] ✗
  GPT-5-nano:   pos_praise (conf=0.85) [2.19s] ✓
  Agreement: NO

--- Test 7/12: pos_progress - follow-up question ---
Expected: pos_progress
  Local (Qwen): neutral (conf=0.00) [0.95s] ✗
  GPT-5-nano:   neutral (conf=0.00) [3.76s] ✗
  Agreement: Yes

--- Test 8/12: pos_progress - extension ---
Expected: pos_progress
  Local (Qwen): neutral (conf=0.00) [1.20s] ✗
  GPT-5-nano:   neutral (conf=0.00) [3.37s] ✗
  Agreement: Yes

--- Test 9/12: neutral - minimal response ---
Expected: neutral
  Local (Qwen): neutral (conf=0.00) [0.78s] ✓
  GPT-5-nano:   neutral (conf=0.00) [4.21s] ✓
  Agreement: Yes

--- Test 10/12: topic_shift - new topic ---
Expected: topic_shift
  Local (Qwen): neutral (conf=0.00) [1.02s] ✗
  GPT-5-nano:   neutral (conf=0.00) [3.86s] ✗
  Agreement: Yes

--- Test 11/12: neg_constraint_restate - language preference ---
Expected: neg_constraint_restate
  Local (Qwen): neutral (conf=0.00) [0.94s] ✗
  GPT-5-nano:   neutral (conf=0.00) [2.31s] ✗
  Agreement: Yes

--- Test 12/12: neg_correction - incomplete answer ---
Expected: neg_correction
  Local (Qwen): neutral (conf=0.00) [0.90s] ✗
  GPT-5-nano:   neutral (conf=0.00) [4.49s] ✗
  Agreement: Yes

================================================================================
SUMMARY
================================================================================
Local (Qwen2.5-1.5B) Accuracy: 16.7% (2/12)
GPT-5-nano Accuracy:           25.0% (3/12)
Agreement Rate:                75.0%

Local Avg Time: 1.07s
GPT Avg Time:   3.60s
Speedup:        3.4x faster (local)

Disagreements (3):
  - neg_constraint_restate - format preference: Local=neg_constraint_restate, GPT=neutral, Expected=neg_constraint_restate
  - pos_praise - explicit thanks: Local=neutral, GPT=pos_praise, Expected=pos_praise
  - pos_praise - great explanation: Local=neutral, GPT=pos_praise, Expected=pos_praise

Local Model Errors (10):
  - neg_constraint_restate - step by step: Got neutral, Expected neg_constraint_restate
  - neg_correction - wrong answer: Got neutral, Expected neg_correction
  - neg_confusion - unclear explanation: Got neutral, Expected neg_confusion
  - pos_praise - explicit thanks: Got neutral, Expected pos_praise
  - pos_praise - great explanation: Got neutral, Expected pos_praise
  - pos_progress - follow-up question: Got neutral, Expected pos_progress
  - pos_progress - extension: Got neutral, Expected pos_progress
  - topic_shift - new topic: Got neutral, Expected topic_shift
  - neg_constraint_restate - language preference: Got neutral, Expected neg_constraint_restate
  - neg_correction - incomplete answer: Got neutral, Expected neg_correction

GPT Model Errors (9):
  - neg_constraint_restate - format preference: Got neutral, Expected neg_constraint_restate
  - neg_constraint_restate - step by step: Got neutral, Expected neg_constraint_restate
  - neg_correction - wrong answer: Got neutral, Expected neg_correction
  - neg_confusion - unclear explanation: Got neutral, Expected neg_confusion
  - pos_progress - follow-up question: Got neutral, Expected pos_progress
  - pos_progress - extension: Got neutral, Expected pos_progress
  - topic_shift - new topic: Got neutral, Expected topic_shift
  - neg_constraint_restate - language preference: Got neutral, Expected neg_constraint_restate
  - neg_correction - incomplete answer: Got neutral, Expected neg_correction

================================================================================

=== Test Complete ===