diff options
| author | YurenHao0426 <blackhao0426@gmail.com> | 2026-01-27 12:15:45 -0600 |
|---|---|---|
| committer | YurenHao0426 <blackhao0426@gmail.com> | 2026-01-27 12:15:45 -0600 |
| commit | 680513b7771a29f27cbbb3ffb009a69a913de6f9 (patch) | |
| tree | a0d60aef9ade1b2953b915f535b990c0de95e493 /scripts/test_reward_cmp_15667126.out | |
| parent | c06ec2f3b80f8968f09eb801b69237495b055ec1 (diff) | |
local reward model
Diffstat (limited to 'scripts/test_reward_cmp_15667126.out')
| -rw-r--r-- | scripts/test_reward_cmp_15667126.out | 104 |
1 files changed, 104 insertions, 0 deletions
diff --git a/scripts/test_reward_cmp_15667126.out b/scripts/test_reward_cmp_15667126.out new file mode 100644 index 0000000..1e10566 --- /dev/null +++ b/scripts/test_reward_cmp_15667126.out @@ -0,0 +1,104 @@ +=== Reward Model Comparison Test === +Local: Llama-3.1-8B-Instruct +API: GPT-4o-mini + +================================================================================ +Reward Model Comparison: Llama-3.1-8B vs GPT-4o-mini +================================================================================ + +Loading models/llama-3.1-8b-instruct... +Model loaded. +Running 12 test cases... + +--- Test 1/12: neg_constraint_restate - format preference --- +Expected: neg_constraint_restate + Local (Llama): neg_correction (conf=0.80) [1.48s] ✗ + GPT-4o-mini: neg_correction (conf=0.90) [1.72s] ✗ + Agreement: Yes + +--- Test 2/12: neg_constraint_restate - step by step --- +Expected: neg_constraint_restate + Local (Llama): neg_constraint_restate (conf=0.90) [1.16s] ✓ + GPT-4o-mini: neg_constraint_restate (conf=0.90) [0.95s] ✓ + Agreement: Yes + +--- Test 3/12: neg_correction - wrong answer --- +Expected: neg_correction + Local (Llama): neg_correction (conf=0.90) [1.03s] ✓ + GPT-4o-mini: neg_correction (conf=0.90) [1.01s] ✓ + Agreement: Yes + +--- Test 4/12: neg_confusion - unclear explanation --- +Expected: neg_confusion + Local (Llama): neg_confusion (conf=0.80) [1.20s] ✓ + GPT-4o-mini: neg_confusion (conf=0.90) [1.14s] ✓ + Agreement: Yes + +--- Test 5/12: pos_praise - explicit thanks --- +Expected: pos_praise + Local (Llama): pos_praise (conf=1.00) [0.97s] ✓ + GPT-4o-mini: pos_praise (conf=0.95) [1.32s] ✓ + Agreement: Yes + +--- Test 6/12: pos_praise - great explanation --- +Expected: pos_praise + Local (Llama): pos_praise (conf=1.00) [0.97s] ✓ + GPT-4o-mini: pos_praise (conf=0.95) [1.02s] ✓ + Agreement: Yes + +--- Test 7/12: pos_progress - follow-up question --- +Expected: pos_progress + Local (Llama): pos_progress (conf=0.90) [1.35s] ✓ + GPT-4o-mini: pos_progress (conf=0.90) [1.15s] ✓ + Agreement: Yes + +--- Test 8/12: pos_progress - extension --- +Expected: pos_progress + Local (Llama): pos_progress (conf=0.90) [1.33s] ✓ + GPT-4o-mini: pos_progress (conf=0.90) [1.25s] ✓ + Agreement: Yes + +--- Test 9/12: neutral - minimal response --- +Expected: neutral + Local (Llama): neutral (conf=0.80) [1.19s] ✓ + GPT-4o-mini: neutral (conf=0.90) [1.24s] ✓ + Agreement: Yes + +--- Test 10/12: topic_shift - new topic --- +Expected: topic_shift + Local (Llama): topic_shift (conf=0.90) [1.21s] ✓ + GPT-4o-mini: topic_shift (conf=0.90) [1.61s] ✓ + Agreement: Yes + +--- Test 11/12: neg_constraint_restate - language preference --- +Expected: neg_constraint_restate + Local (Llama): neg_constraint_restate (conf=0.80) [1.38s] ✓ + GPT-4o-mini: neg_constraint_restate (conf=0.90) [2.55s] ✓ + Agreement: Yes + +--- Test 12/12: neg_correction - incomplete answer --- +Expected: neg_correction + Local (Llama): neg_correction (conf=0.80) [1.00s] ✓ + GPT-4o-mini: neg_correction (conf=0.90) [2.35s] ✓ + Agreement: Yes + +================================================================================ +SUMMARY +================================================================================ +Local (Llama-3.1-8B) Accuracy: 91.7% (11/12) +GPT-4o-mini Accuracy: 91.7% (11/12) +Agreement Rate: 100.0% + +Local Avg Time: 1.19s +GPT Avg Time: 1.44s +Speedup: 1.2x faster (local) + +Local Model Errors (1): + - neg_constraint_restate - format preference: Got neg_correction, Expected neg_constraint_restate + +GPT Model Errors (1): + - neg_constraint_restate - format preference: Got neg_correction, Expected neg_constraint_restate + +================================================================================ + +=== Test Complete === |
