diff options
Diffstat (limited to 'notes.md')
| -rw-r--r-- | notes.md | 22 |
1 files changed, 19 insertions, 3 deletions
@@ -432,12 +432,28 @@ Rewrite draft: "(3, π/2)" | Repetition bug | 254 (7.1%) | 138 (3.8%) | | JSON leak | ~0 | ~0 | -### 清理后指标对比 +### 清理后指标对比 (全方法) + +**⚠️ Dataset差异**: vanilla/contextual/all_memory 使用 `math-hard,humaneval,mmlu,bigcodebench,aime`(含较简单的mmlu/humaneval),rag/reflection/rag_vector 使用 `math-hard,math-500,bigcodebench`。已排队跑matching dataset的vanilla/contextual (queue_baselines_60s.sh)。 + +| Method | Success | Timeout | E/T | User Tokens | Dataset | +|--------|---------|---------|-----|-------------|---------| +| vanilla† | 66.9% | 28.1% | 0.377 | 256.8 | 5-dataset | +| contextual† | 65.3% | 29.4% | 0.407 | 237.2 | 5-dataset | +| all_memory† | 61.8% | 33.1% | 0.374 | 248.6 | 5-dataset | +| rag | 52.0% | 44.3% | 0.402 | 188.4 | 3-dataset | +| reflection | 53.6% | 42.9% | 0.374 | 205.8 | 3-dataset | +| **rag_vector** | **54.2%** | **42.7%** | 0.400 | **191.7** | 3-dataset | + +†不同dataset,数值不可直接比较。3-dataset (math-hard,math-500,bigcodebench) 更难。 + +**同dataset对比 (rag/reflection/rag_vector)**: | Method | Success | Timeout | E/T | User Tokens | |--------|---------|---------|-----|-------------| -| reflection (cleaned) | 54.4% | 28.8% | 0.175 | 207.5 | -| **rag_vector (cleaned)** | **55.2%** | **26.4%** | 0.186 | **193.6** | +| rag | 52.0% | 44.3% | 0.402 | 188.4 | +| reflection | 53.6% | 42.9% | 0.374 | 205.8 | +| **rag_vector** | **54.2%** | **42.7%** | 0.400 | **191.7** | ### 显著性检验 (paired t-test, one-sided, N=60 profiles) |
