blob: 1e105666e9d64b37767baf8879a174bf2da042a0 (
plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
|
=== Reward Model Comparison Test ===
Local: Llama-3.1-8B-Instruct
API: GPT-4o-mini
================================================================================
Reward Model Comparison: Llama-3.1-8B vs GPT-4o-mini
================================================================================
Loading models/llama-3.1-8b-instruct...
Model loaded.
Running 12 test cases...
--- Test 1/12: neg_constraint_restate - format preference ---
Expected: neg_constraint_restate
Local (Llama): neg_correction (conf=0.80) [1.48s] ✗
GPT-4o-mini: neg_correction (conf=0.90) [1.72s] ✗
Agreement: Yes
--- Test 2/12: neg_constraint_restate - step by step ---
Expected: neg_constraint_restate
Local (Llama): neg_constraint_restate (conf=0.90) [1.16s] ✓
GPT-4o-mini: neg_constraint_restate (conf=0.90) [0.95s] ✓
Agreement: Yes
--- Test 3/12: neg_correction - wrong answer ---
Expected: neg_correction
Local (Llama): neg_correction (conf=0.90) [1.03s] ✓
GPT-4o-mini: neg_correction (conf=0.90) [1.01s] ✓
Agreement: Yes
--- Test 4/12: neg_confusion - unclear explanation ---
Expected: neg_confusion
Local (Llama): neg_confusion (conf=0.80) [1.20s] ✓
GPT-4o-mini: neg_confusion (conf=0.90) [1.14s] ✓
Agreement: Yes
--- Test 5/12: pos_praise - explicit thanks ---
Expected: pos_praise
Local (Llama): pos_praise (conf=1.00) [0.97s] ✓
GPT-4o-mini: pos_praise (conf=0.95) [1.32s] ✓
Agreement: Yes
--- Test 6/12: pos_praise - great explanation ---
Expected: pos_praise
Local (Llama): pos_praise (conf=1.00) [0.97s] ✓
GPT-4o-mini: pos_praise (conf=0.95) [1.02s] ✓
Agreement: Yes
--- Test 7/12: pos_progress - follow-up question ---
Expected: pos_progress
Local (Llama): pos_progress (conf=0.90) [1.35s] ✓
GPT-4o-mini: pos_progress (conf=0.90) [1.15s] ✓
Agreement: Yes
--- Test 8/12: pos_progress - extension ---
Expected: pos_progress
Local (Llama): pos_progress (conf=0.90) [1.33s] ✓
GPT-4o-mini: pos_progress (conf=0.90) [1.25s] ✓
Agreement: Yes
--- Test 9/12: neutral - minimal response ---
Expected: neutral
Local (Llama): neutral (conf=0.80) [1.19s] ✓
GPT-4o-mini: neutral (conf=0.90) [1.24s] ✓
Agreement: Yes
--- Test 10/12: topic_shift - new topic ---
Expected: topic_shift
Local (Llama): topic_shift (conf=0.90) [1.21s] ✓
GPT-4o-mini: topic_shift (conf=0.90) [1.61s] ✓
Agreement: Yes
--- Test 11/12: neg_constraint_restate - language preference ---
Expected: neg_constraint_restate
Local (Llama): neg_constraint_restate (conf=0.80) [1.38s] ✓
GPT-4o-mini: neg_constraint_restate (conf=0.90) [2.55s] ✓
Agreement: Yes
--- Test 12/12: neg_correction - incomplete answer ---
Expected: neg_correction
Local (Llama): neg_correction (conf=0.80) [1.00s] ✓
GPT-4o-mini: neg_correction (conf=0.90) [2.35s] ✓
Agreement: Yes
================================================================================
SUMMARY
================================================================================
Local (Llama-3.1-8B) Accuracy: 91.7% (11/12)
GPT-4o-mini Accuracy: 91.7% (11/12)
Agreement Rate: 100.0%
Local Avg Time: 1.19s
GPT Avg Time: 1.44s
Speedup: 1.2x faster (local)
Local Model Errors (1):
- neg_constraint_restate - format preference: Got neg_correction, Expected neg_constraint_restate
GPT Model Errors (1):
- neg_constraint_restate - format preference: Got neg_correction, Expected neg_constraint_restate
================================================================================
=== Test Complete ===
|