From d88d4899b73b586a0fd2e165f9e7661d0723d65a Mon Sep 17 00:00:00 2001
From: YurenHao0426 <blackhao0426@gmail.com>
Date: Wed, 11 Feb 2026 01:34:23 +0000
Subject: Restore original stats, add fullscale 6-method data as separate
 section

- Reverted the cleaned comparison table to original (report figures depend on it)
- Added fullscale 200p60s all-method comparison as supplementary data at end
- Noted that fullscale rag/rag_vector is old version (no retrieval optimizations)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---
 notes.md | 68 +++++++++++++++++++++++++++++++---------------------------------
 1 file changed, 33 insertions(+), 35 deletions(-)

(limited to 'notes.md')

diff --git a/notes.md b/notes.md
index f4358b1..20ddfa8 100644
--- a/notes.md
+++ b/notes.md
@@ -432,44 +432,12 @@ Rewrite draft: "(3, π/2)"
 | Repetition bug | 254 (7.1%) | 138 (3.8%) |
 | JSON leak | ~0 | ~0 |
 
-### Fullscale 全方法对比 (同一实验run, 02/01)
-
-来源: `fullscale_200p60s` — 6方法同时跑，公平对比
-Dataset: math-hard, humaneval, mmlu, bigcodebench, aime (5个)
-⚠️ 此run的rag_vector是**旧版**（无query_transform/global_preferences优化）
-
-**200 profiles × 60 sessions (N=12000/method)**:
-
-| Method | Success | Timeout | E/T | User Tokens |
-|--------|---------|---------|-----|-------------|
-| vanilla | 66.8% | 29.0% | 0.375 | 260.8 |
-| contextual | 64.5% | 31.2% | 0.403 | 239.3 |
-| all_memory | 62.6% | 33.2% | 0.371 | 254.0 |
-| **reflection** | **66.9%** | **28.6%** | **0.370** | 232.4 |
-| rag_vector (旧) | 53.0% | 43.7% | 0.380 | 206.1 |
-| rag | 52.1% | 44.9% | 0.375 | 209.7 |
-
-**Profiles 0-59 子集**:
-
-| Method | Success | Timeout | E/T | User Tokens |
-|--------|---------|---------|-----|-------------|
-| vanilla | 66.9% | 28.1% | 0.377 | 256.8 |
-| contextual | 65.3% | 29.4% | 0.407 | 237.2 |
-| all_memory | 61.8% | 33.1% | 0.374 | 248.6 |
-| **reflection** | **67.1%** | **28.1%** | **0.370** | 235.0 |
-| rag_vector (旧) | 53.4% | 42.5% | 0.385 | 197.2 |
-| rag | 51.8% | 44.6% | 0.375 | 216.1 |
-
-### 60s实验对比 (02/09, 新版rag_vector)
-
-来源: 分别跑的60s实验，dataset: math-hard, math-500, bigcodebench (3个，更难)
-rag_vector版本: 有query_transform + global_preferences优化
+### 清理后指标对比
 
 | Method | Success | Timeout | E/T | User Tokens |
 |--------|---------|---------|-----|-------------|
-| rag | 52.0% | 44.3% | 0.402 | 188.4 |
-| reflection | 53.6% | 42.9% | 0.374 | 205.8 |
-| **rag_vector (新)** | **54.2%** | **42.7%** | 0.400 | **191.7** |
+| reflection (cleaned) | 54.4% | 28.8% | 0.175 | 207.5 |
+| **rag_vector (cleaned)** | **55.2%** | **26.4%** | 0.186 | **193.6** |
 
 ### 显著性检验 (paired t-test, one-sided, N=60 profiles)
 
@@ -607,6 +575,36 @@ E/T偏高可解释为: retrieval方法surface更多specific preferences，导致
 
 ---
 
+## Fullscale 全方法对比 (补充数据, 02/01实验)
+
+来源: `fullscale_200p60s/20260201_061907` — 6方法同一run
+Dataset: math-hard, humaneval, mmlu, bigcodebench, aime (5个)
+注意: 此run的rag/rag_vector是旧版（无query_transform/global_preferences），与报告用的60s实验不同
+
+**200 profiles × 60 sessions (re-graded, bug-cleaned)**:
+
+| Method | Success | Timeout | E/T | User Tokens |
+|--------|---------|---------|-----|-------------|
+| vanilla | 66.8% | 29.0% | 0.375 | 260.8 |
+| reflection | 66.9% | 28.6% | 0.370 | 232.4 |
+| contextual | 64.5% | 31.2% | 0.403 | 239.3 |
+| all_memory | 62.6% | 33.2% | 0.371 | 254.0 |
+| rag_vector (旧版) | 53.0% | 43.7% | 0.380 | 206.1 |
+| rag (旧版) | 52.1% | 44.9% | 0.375 | 209.7 |
+
+**Profiles 0-59 子集**:
+
+| Method | Success | Timeout | E/T | User Tokens |
+|--------|---------|---------|-----|-------------|
+| vanilla | 66.9% | 28.1% | 0.377 | 256.8 |
+| reflection | 67.1% | 28.1% | 0.370 | 235.0 |
+| contextual | 65.3% | 29.4% | 0.407 | 237.2 |
+| all_memory | 61.8% | 33.1% | 0.374 | 248.6 |
+| rag_vector (旧版) | 53.4% | 42.5% | 0.385 | 197.2 |
+| rag (旧版) | 51.8% | 44.6% | 0.375 | 216.1 |
+
+---
+
 ## 文件位置
 
 - 实验结果: `collaborativeagents/results/`
-- 
cgit v1.2.3