diff options
| author | YurenHao0426 <blackhao0426@gmail.com> | 2026-02-11 01:13:48 +0000 |
|---|---|---|
| committer | YurenHao0426 <blackhao0426@gmail.com> | 2026-02-11 01:13:48 +0000 |
| commit | da40fccaa2176349482581bb0f7fb610e168f1b5 (patch) | |
| tree | 6611ee689433e20268713ed7c53d55020246f813 | |
| parent | 08a08befe8ac0202ce952b88aa144cbffe2a73f9 (diff) | |
Add all-methods comparison table with dataset caveat
- Add vanilla/contextual/all_memory from fullscale (different datasets)
- Flag dataset mismatch (5-dataset vs 3-dataset)
- Queue matching-dataset vanilla/contextual experiment
- Same-dataset comparison: rag_vector leads in success (54.2%)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| -rwxr-xr-x | collaborativeagents/scripts/queue_baselines_60s.sh | 46 | ||||
| -rw-r--r-- | notes.md | 22 |
2 files changed, 65 insertions, 3 deletions
diff --git a/collaborativeagents/scripts/queue_baselines_60s.sh b/collaborativeagents/scripts/queue_baselines_60s.sh new file mode 100755 index 0000000..0938ff4 --- /dev/null +++ b/collaborativeagents/scripts/queue_baselines_60s.sh @@ -0,0 +1,46 @@ +#!/bin/bash +# Run vanilla and contextual baselines with matching datasets (math-hard, math-500, bigcodebench) +# Same setup as rag/reflection/rag_vector 60s experiments for fair comparison + +echo "Starting baseline experiments (vanilla + contextual)..." +echo "$(date '+%Y-%m-%d %H:%M:%S')" + +# Vanilla (no personalization) +echo "Starting vanilla..." +python collaborativeagents/scripts/run_experiments.py \ + --methods vanilla \ + --datasets math-hard,math-500,bigcodebench \ + --n-profiles 60 \ + --n-sessions 60 \ + --max-turns 10 \ + --use-vllm \ + --vllm-agent-url http://localhost:8003/v1 \ + --vllm-user-url http://localhost:8004/v1 \ + --use-batch-processing \ + --batch-size 4 \ + --parallel-profiles 20 \ + --profile-path collaborativeagents/data/complex_profiles_v2/profiles_200.jsonl \ + --output-dir collaborativeagents/results/vanilla_60s \ + 2>&1 | tee collaborativeagents/results/vanilla_60s.log + +echo "Vanilla completed at $(date '+%Y-%m-%d %H:%M:%S')" + +# Contextual (full history in context) +echo "Starting contextual..." +python collaborativeagents/scripts/run_experiments.py \ + --methods contextual \ + --datasets math-hard,math-500,bigcodebench \ + --n-profiles 60 \ + --n-sessions 60 \ + --max-turns 10 \ + --use-vllm \ + --vllm-agent-url http://localhost:8003/v1 \ + --vllm-user-url http://localhost:8004/v1 \ + --use-batch-processing \ + --batch-size 4 \ + --parallel-profiles 20 \ + --profile-path collaborativeagents/data/complex_profiles_v2/profiles_200.jsonl \ + --output-dir collaborativeagents/results/contextual_60s \ + 2>&1 | tee collaborativeagents/results/contextual_60s.log + +echo "Both baselines completed at $(date '+%Y-%m-%d %H:%M:%S')" @@ -432,12 +432,28 @@ Rewrite draft: "(3, π/2)" | Repetition bug | 254 (7.1%) | 138 (3.8%) | | JSON leak | ~0 | ~0 | -### 清理后指标对比 +### 清理后指标对比 (全方法) + +**⚠️ Dataset差异**: vanilla/contextual/all_memory 使用 `math-hard,humaneval,mmlu,bigcodebench,aime`(含较简单的mmlu/humaneval),rag/reflection/rag_vector 使用 `math-hard,math-500,bigcodebench`。已排队跑matching dataset的vanilla/contextual (queue_baselines_60s.sh)。 + +| Method | Success | Timeout | E/T | User Tokens | Dataset | +|--------|---------|---------|-----|-------------|---------| +| vanilla† | 66.9% | 28.1% | 0.377 | 256.8 | 5-dataset | +| contextual† | 65.3% | 29.4% | 0.407 | 237.2 | 5-dataset | +| all_memory† | 61.8% | 33.1% | 0.374 | 248.6 | 5-dataset | +| rag | 52.0% | 44.3% | 0.402 | 188.4 | 3-dataset | +| reflection | 53.6% | 42.9% | 0.374 | 205.8 | 3-dataset | +| **rag_vector** | **54.2%** | **42.7%** | 0.400 | **191.7** | 3-dataset | + +†不同dataset,数值不可直接比较。3-dataset (math-hard,math-500,bigcodebench) 更难。 + +**同dataset对比 (rag/reflection/rag_vector)**: | Method | Success | Timeout | E/T | User Tokens | |--------|---------|---------|-----|-------------| -| reflection (cleaned) | 54.4% | 28.8% | 0.175 | 207.5 | -| **rag_vector (cleaned)** | **55.2%** | **26.4%** | 0.186 | **193.6** | +| rag | 52.0% | 44.3% | 0.402 | 188.4 | +| reflection | 53.6% | 42.9% | 0.374 | 205.8 | +| **rag_vector** | **54.2%** | **42.7%** | 0.400 | **191.7** | ### 显著性检验 (paired t-test, one-sided, N=60 profiles) |
