Fix all-methods comparison: use same experiment run (fullscale 200p60s)

- Fullscale has all 6 methods in one run (fair comparison) - Note: fullscale rag_vector is old version (no query_transform/global_prefs) - Separate section for 60s experiments (new rag_vector, harder datasets) - reflection dominates in fullscale (66.9%), rag/rag_vector ~52-53% Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
author: YurenHao0426 <blackhao0426@gmail.com> 2026-02-11 01:31:45 +0000
committer: YurenHao0426 <blackhao0426@gmail.com> 2026-02-11 01:31:45 +0000
commit: 1b198877233afd9ffa0093f0c4e9c2959c2589e9 (patch)
tree: 374c0da5c8d72c0faec3fce95dd2b6ebf9a17ff3
parent: da40fccaa2176349482581bb0f7fb610e168f1b5 (diff)
1 files changed, 29 insertions, 13 deletions
diff --git a/notes.md b/notes.md
index 15f6542..f4358b1 100644
--- a/notes.md
+++ b/notes.md
@@ -432,28 +432,44 @@ Rewrite draft: "(3, π/2)"
 | Repetition bug | 254 (7.1%) | 138 (3.8%) |
 | JSON leak | ~0 | ~0 |
 
-### 清理后指标对比 (全方法)
+### Fullscale 全方法对比 (同一实验run, 02/01)
 
-**⚠️ Dataset差异**: vanilla/contextual/all_memory 使用 `math-hard,humaneval,mmlu,bigcodebench,aime`（含较简单的mmlu/humaneval），rag/reflection/rag_vector 使用 `math-hard,math-500,bigcodebench`。已排队跑matching dataset的vanilla/contextual (queue_baselines_60s.sh)。
+来源: `fullscale_200p60s` — 6方法同时跑，公平对比
+Dataset: math-hard, humaneval, mmlu, bigcodebench, aime (5个)
+⚠️ 此run的rag_vector是**旧版**（无query_transform/global_preferences优化）
 
-| Method | Success | Timeout | E/T | User Tokens | Dataset |
-|--------|---------|---------|-----|-------------|---------|
-| vanilla† | 66.9% | 28.1% | 0.377 | 256.8 | 5-dataset |
-| contextual† | 65.3% | 29.4% | 0.407 | 237.2 | 5-dataset |
-| all_memory† | 61.8% | 33.1% | 0.374 | 248.6 | 5-dataset |
-| rag | 52.0% | 44.3% | 0.402 | 188.4 | 3-dataset |
-| reflection | 53.6% | 42.9% | 0.374 | 205.8 | 3-dataset |
-| **rag_vector** | **54.2%** | **42.7%** | 0.400 | **191.7** | 3-dataset |
+**200 profiles × 60 sessions (N=12000/method)**:
 
-†不同dataset，数值不可直接比较。3-dataset (math-hard,math-500,bigcodebench) 更难。
+| Method | Success | Timeout | E/T | User Tokens |
+|--------|---------|---------|-----|-------------|
+| vanilla | 66.8% | 29.0% | 0.375 | 260.8 |
+| contextual | 64.5% | 31.2% | 0.403 | 239.3 |
+| all_memory | 62.6% | 33.2% | 0.371 | 254.0 |
+| **reflection** | **66.9%** | **28.6%** | **0.370** | 232.4 |
+| rag_vector (旧) | 53.0% | 43.7% | 0.380 | 206.1 |
+| rag | 52.1% | 44.9% | 0.375 | 209.7 |
+
+**Profiles 0-59 子集**:
+
+| Method | Success | Timeout | E/T | User Tokens |
+|--------|---------|---------|-----|-------------|
+| vanilla | 66.9% | 28.1% | 0.377 | 256.8 |
+| contextual | 65.3% | 29.4% | 0.407 | 237.2 |
+| all_memory | 61.8% | 33.1% | 0.374 | 248.6 |
+| **reflection** | **67.1%** | **28.1%** | **0.370** | 235.0 |
+| rag_vector (旧) | 53.4% | 42.5% | 0.385 | 197.2 |
+| rag | 51.8% | 44.6% | 0.375 | 216.1 |
+
+### 60s实验对比 (02/09, 新版rag_vector)
 
-**同dataset对比 (rag/reflection/rag_vector)**:
+来源: 分别跑的60s实验，dataset: math-hard, math-500, bigcodebench (3个，更难)
+rag_vector版本: 有query_transform + global_preferences优化
 
 | Method | Success | Timeout | E/T | User Tokens |
 |--------|---------|---------|-----|-------------|
 | rag | 52.0% | 44.3% | 0.402 | 188.4 |
 | reflection | 53.6% | 42.9% | 0.374 | 205.8 |
-| **rag_vector** | **54.2%** | **42.7%** | 0.400 | **191.7** |
+| **rag_vector (新)** | **54.2%** | **42.7%** | 0.400 | **191.7** |
 
 ### 显著性检验 (paired t-test, one-sided, N=60 profiles)
author	YurenHao0426 <blackhao0426@gmail.com>	2026-02-11 01:31:45 +0000
committer	YurenHao0426 <blackhao0426@gmail.com>	2026-02-11 01:31:45 +0000
commit	1b198877233afd9ffa0093f0c4e9c2959c2589e9 (patch)
tree	374c0da5c8d72c0faec3fce95dd2b6ebf9a17ff3
parent	da40fccaa2176349482581bb0f7fb610e168f1b5 (diff)