From da40fccaa2176349482581bb0f7fb610e168f1b5 Mon Sep 17 00:00:00 2001 From: YurenHao0426 Date: Wed, 11 Feb 2026 01:13:48 +0000 Subject: Add all-methods comparison table with dataset caveat - Add vanilla/contextual/all_memory from fullscale (different datasets) - Flag dataset mismatch (5-dataset vs 3-dataset) - Queue matching-dataset vanilla/contextual experiment - Same-dataset comparison: rag_vector leads in success (54.2%) Co-Authored-By: Claude Opus 4.6 --- notes.md | 22 +++++++++++++++++++--- 1 file changed, 19 insertions(+), 3 deletions(-) (limited to 'notes.md') diff --git a/notes.md b/notes.md index 919633c..15f6542 100644 --- a/notes.md +++ b/notes.md @@ -432,12 +432,28 @@ Rewrite draft: "(3, π/2)" | Repetition bug | 254 (7.1%) | 138 (3.8%) | | JSON leak | ~0 | ~0 | -### 清理后指标对比 +### 清理后指标对比 (全方法) + +**⚠️ Dataset差异**: vanilla/contextual/all_memory 使用 `math-hard,humaneval,mmlu,bigcodebench,aime`(含较简单的mmlu/humaneval),rag/reflection/rag_vector 使用 `math-hard,math-500,bigcodebench`。已排队跑matching dataset的vanilla/contextual (queue_baselines_60s.sh)。 + +| Method | Success | Timeout | E/T | User Tokens | Dataset | +|--------|---------|---------|-----|-------------|---------| +| vanilla† | 66.9% | 28.1% | 0.377 | 256.8 | 5-dataset | +| contextual† | 65.3% | 29.4% | 0.407 | 237.2 | 5-dataset | +| all_memory† | 61.8% | 33.1% | 0.374 | 248.6 | 5-dataset | +| rag | 52.0% | 44.3% | 0.402 | 188.4 | 3-dataset | +| reflection | 53.6% | 42.9% | 0.374 | 205.8 | 3-dataset | +| **rag_vector** | **54.2%** | **42.7%** | 0.400 | **191.7** | 3-dataset | + +†不同dataset,数值不可直接比较。3-dataset (math-hard,math-500,bigcodebench) 更难。 + +**同dataset对比 (rag/reflection/rag_vector)**: | Method | Success | Timeout | E/T | User Tokens | |--------|---------|---------|-----|-------------| -| reflection (cleaned) | 54.4% | 28.8% | 0.175 | 207.5 | -| **rag_vector (cleaned)** | **55.2%** | **26.4%** | 0.186 | **193.6** | +| rag | 52.0% | 44.3% | 0.402 | 188.4 | +| reflection | 53.6% | 42.9% | 0.374 | 205.8 | +| **rag_vector** | **54.2%** | **42.7%** | 0.400 | **191.7** | ### 显著性检验 (paired t-test, one-sided, N=60 profiles) -- cgit v1.2.3