diff options
| author | YurenHao0426 <blackhao0426@gmail.com> | 2026-01-27 12:15:45 -0600 |
|---|---|---|
| committer | YurenHao0426 <blackhao0426@gmail.com> | 2026-01-27 12:15:45 -0600 |
| commit | 680513b7771a29f27cbbb3ffb009a69a913de6f9 (patch) | |
| tree | a0d60aef9ade1b2953b915f535b990c0de95e493 /scripts/test_reward_cmp_15667076.out | |
| parent | c06ec2f3b80f8968f09eb801b69237495b055ec1 (diff) | |
local reward model
Diffstat (limited to 'scripts/test_reward_cmp_15667076.out')
| -rw-r--r-- | scripts/test_reward_cmp_15667076.out | 126 |
1 files changed, 126 insertions, 0 deletions
diff --git a/scripts/test_reward_cmp_15667076.out b/scripts/test_reward_cmp_15667076.out new file mode 100644 index 0000000..67693b1 --- /dev/null +++ b/scripts/test_reward_cmp_15667076.out @@ -0,0 +1,126 @@ +=== Reward Model Comparison Test === +Local: Qwen2.5-1.5B-Instruct +API: GPT-5-nano + +================================================================================ +Reward Model Comparison: Qwen2.5-1.5B vs GPT-5-nano +================================================================================ + +Loading models/qwen2.5-1.5b-instruct... +Model loaded. +Running 12 test cases... + +--- Test 1/12: neg_constraint_restate - format preference --- +Expected: neg_constraint_restate + Local (Qwen): neg_constraint_restate (conf=0.95) [2.42s] ✓ + GPT-5-nano: neutral (conf=0.00) [4.97s] ✗ + Agreement: NO + +--- Test 2/12: neg_constraint_restate - step by step --- +Expected: neg_constraint_restate + Local (Qwen): neutral (conf=0.00) [1.04s] ✗ + GPT-5-nano: neutral (conf=0.00) [4.43s] ✗ + Agreement: Yes + +--- Test 3/12: neg_correction - wrong answer --- +Expected: neg_correction + Local (Qwen): neutral (conf=0.00) [0.93s] ✗ + GPT-5-nano: neutral (conf=0.00) [3.53s] ✗ + Agreement: Yes + +--- Test 4/12: neg_confusion - unclear explanation --- +Expected: neg_confusion + Local (Qwen): neutral (conf=0.00) [0.89s] ✗ + GPT-5-nano: neutral (conf=0.00) [2.52s] ✗ + Agreement: Yes + +--- Test 5/12: pos_praise - explicit thanks --- +Expected: pos_praise + Local (Qwen): neutral (conf=0.00) [0.88s] ✗ + GPT-5-nano: pos_praise (conf=0.92) [3.57s] ✓ + Agreement: NO + +--- Test 6/12: pos_praise - great explanation --- +Expected: pos_praise + Local (Qwen): neutral (conf=0.00) [0.92s] ✗ + GPT-5-nano: pos_praise (conf=0.85) [2.19s] ✓ + Agreement: NO + +--- Test 7/12: pos_progress - follow-up question --- +Expected: pos_progress + Local (Qwen): neutral (conf=0.00) [0.95s] ✗ + GPT-5-nano: neutral (conf=0.00) [3.76s] ✗ + Agreement: Yes + +--- Test 8/12: pos_progress - extension --- +Expected: pos_progress + Local (Qwen): neutral (conf=0.00) [1.20s] ✗ + GPT-5-nano: neutral (conf=0.00) [3.37s] ✗ + Agreement: Yes + +--- Test 9/12: neutral - minimal response --- +Expected: neutral + Local (Qwen): neutral (conf=0.00) [0.78s] ✓ + GPT-5-nano: neutral (conf=0.00) [4.21s] ✓ + Agreement: Yes + +--- Test 10/12: topic_shift - new topic --- +Expected: topic_shift + Local (Qwen): neutral (conf=0.00) [1.02s] ✗ + GPT-5-nano: neutral (conf=0.00) [3.86s] ✗ + Agreement: Yes + +--- Test 11/12: neg_constraint_restate - language preference --- +Expected: neg_constraint_restate + Local (Qwen): neutral (conf=0.00) [0.94s] ✗ + GPT-5-nano: neutral (conf=0.00) [2.31s] ✗ + Agreement: Yes + +--- Test 12/12: neg_correction - incomplete answer --- +Expected: neg_correction + Local (Qwen): neutral (conf=0.00) [0.90s] ✗ + GPT-5-nano: neutral (conf=0.00) [4.49s] ✗ + Agreement: Yes + +================================================================================ +SUMMARY +================================================================================ +Local (Qwen2.5-1.5B) Accuracy: 16.7% (2/12) +GPT-5-nano Accuracy: 25.0% (3/12) +Agreement Rate: 75.0% + +Local Avg Time: 1.07s +GPT Avg Time: 3.60s +Speedup: 3.4x faster (local) + +Disagreements (3): + - neg_constraint_restate - format preference: Local=neg_constraint_restate, GPT=neutral, Expected=neg_constraint_restate + - pos_praise - explicit thanks: Local=neutral, GPT=pos_praise, Expected=pos_praise + - pos_praise - great explanation: Local=neutral, GPT=pos_praise, Expected=pos_praise + +Local Model Errors (10): + - neg_constraint_restate - step by step: Got neutral, Expected neg_constraint_restate + - neg_correction - wrong answer: Got neutral, Expected neg_correction + - neg_confusion - unclear explanation: Got neutral, Expected neg_confusion + - pos_praise - explicit thanks: Got neutral, Expected pos_praise + - pos_praise - great explanation: Got neutral, Expected pos_praise + - pos_progress - follow-up question: Got neutral, Expected pos_progress + - pos_progress - extension: Got neutral, Expected pos_progress + - topic_shift - new topic: Got neutral, Expected topic_shift + - neg_constraint_restate - language preference: Got neutral, Expected neg_constraint_restate + - neg_correction - incomplete answer: Got neutral, Expected neg_correction + +GPT Model Errors (9): + - neg_constraint_restate - format preference: Got neutral, Expected neg_constraint_restate + - neg_constraint_restate - step by step: Got neutral, Expected neg_constraint_restate + - neg_correction - wrong answer: Got neutral, Expected neg_correction + - neg_confusion - unclear explanation: Got neutral, Expected neg_confusion + - pos_progress - follow-up question: Got neutral, Expected pos_progress + - pos_progress - extension: Got neutral, Expected pos_progress + - topic_shift - new topic: Got neutral, Expected topic_shift + - neg_constraint_restate - language preference: Got neutral, Expected neg_constraint_restate + - neg_correction - incomplete answer: Got neutral, Expected neg_correction + +================================================================================ + +=== Test Complete === |
