summaryrefslogtreecommitdiff
path: root/notes.md
diff options
context:
space:
mode:
Diffstat (limited to 'notes.md')
-rw-r--r--notes.md385
1 files changed, 385 insertions, 0 deletions
diff --git a/notes.md b/notes.md
new file mode 100644
index 0000000..be4e7de
--- /dev/null
+++ b/notes.md
@@ -0,0 +1,385 @@
+# Personalization Experiments Notes
+
+## 已完成实验结果
+
+### 30 Sessions (60 profiles)
+
+| 方法 | Success | E/T | Early(0-9) | Late(20-29) | 学习斜率 |
+|------|---------|-----|------------|-------------|----------|
+| reflection_v2 | 43.3% | 0.179 | 36.7% | 56.5% | +19.8% |
+| rag_topk5_v3 | 41.3% | 0.188 | 31.2% | 58.0% | +26.8% |
+| **rag_vector_v3** | **43.8%** | 0.188 | 38.5% | **59.2%** | +20.7% |
+
+**关键发现**:
+- rag_vector总成功率最高 (43.8% > reflection 43.3%)
+- 后期表现: rag_vector 59.2% > reflection 56.5%
+- RAG学习斜率更大 (+26.8% vs +19.8%)
+
+### 60 Sessions (60 profiles) - ✅ 完成 02/10 00:55
+
+**最终结果**:
+
+| 方法 | Success | Timeout | E/T | Early(0-9) | Late(50-59) | Slope |
+|------|---------|---------|-----|------------|-------------|-------|
+| reflection_60s | **40.3%** | 29.5% | **0.179** | **38.0%** | 43.5% | +5.5% |
+| rag_vector_60s | 37.1% | 27.3% | 0.192 | 35.0% | 40.0% | +5.0% |
+| rag_60s | 36.4% | **25.8%** | 0.192 | 33.8% | **42.5%** | **+8.7%** |
+
+**关键观察**:
+1. **rag_vector > rag** (37.1% vs 36.4%),vector帮助了0.7%
+2. **reflection仍然领先** (40.3%),差距3.1%
+3. **RAG timeout更低** (25.8-27.3% vs 29.5%)
+4. **rag学习斜率最高** (+8.7%)
+5. **直接对比**: rag_vector赢733 vs rag赢708 (净+25)
+
+**按数据集**:
+| 数据集 | reflection | rag | rag_vector |
+|--------|------------|-----|------------|
+| math | 35.4% | 30.7% | 30.9% |
+| bigcodebench | 48.7% | 46.1% | 47.5% |
+
+### Full Scale (200 profiles × 60 sessions) - 旧版
+
+| 方法 | Success | Code | Math |
+|------|---------|------|------|
+| vanilla | 52.8% | 73.1% | 39.2% |
+| reflection | 54.0% | 73.2% | 41.2% |
+| rag | 40.2% | 56.4% | 29.4% |
+| rag_vector | 40.6% | 56.3% | 30.2% |
+
+**注意**: Full scale用的是旧版代码,没有query_transform和global_preferences优化
+
+---
+
+## Vector-Preference 相关性分析
+
+### 30s实验 (rag_vector_v3)
+
+**Vector统计**:
+- 59/60 users有非零vector
+- Norm: mean=0.0023, max=0.0051
+- Vector随session增长: 0.00086 (s10) → 0.00227 (s30)
+
+**Jaccard(revealed_prefs) vs Vector相似度**:
+- Spearman r = 0.09 (p=0.015) **显著**
+- 趋势正确: 更多shared preferences → 更高vector相似度
+
+| Jaccard范围 | 用户对数 | 平均Vector相似度 |
+|-------------|---------|------------------|
+| 0.00-0.05 | 413 | -0.007 |
+| 0.05-0.10 | 215 | 0.034 |
+| 0.10-0.15 | 43 | 0.037 |
+| 0.15-0.25 | 10 | 0.072 |
+
+**对比之前论文**: r=0.90 (5 users, 20 sessions) vs 当前 r=0.09 (60 users, 30 sessions)
+
+---
+
+## RAG vs Reflection 分析
+
+### 为什么RAG不如Reflection?
+
+1. **检索方式不同**:
+ - Reflection: LLM-based筛选 (proper_scaffolding)
+ - RAG: Dense embedding + reranker
+
+2. **Notes格式不同**:
+ - Reflection: LLM反思总结,简洁coherent
+ - RAG: 原始extracted preferences,可能冗长
+
+3. **Case分析** (60s实验):
+ - Reflection赢: 827 cases
+ - RAG赢: 686 cases
+ - Math上差距大: Reflection +117
+ - Code上差距小: Reflection +24
+
+### 典型失败Case
+- RAG检索到更多细节preferences(如"JSON format")
+- Agent无法同时遵守所有preferences
+- 用户反复enforce → 超时/失败
+
+---
+
+## 改进方案
+
+### 已实现: rag_rewrite
+
+**原理**: 用LLM合并多个preferences为1-2条简洁指令
+
+**流程**:
+```
+RAG检索 top-5 preferences
+ ↓
+LLM rewrite: 合并+简化
+ ↓
+1条简洁指令注入prompt
+```
+
+**Prompt关键点**:
+- 温度0.3(稳定输出)
+- 输出1-2句
+- 过滤不相关preferences
+- 失败时fallback到原始notes
+
+**Bug修复**: 2026-02-09 修复了API调用参数类型不匹配问题
+
+---
+
+## 60s实验完整结果 (02/10 更新)
+
+### 所有方法对比
+
+| 方法 | Success | Timeout | E/T | 状态 |
+|------|---------|---------|-----|------|
+| reflection_60s | **40.3%** | 29.5% | **0.179** | ✅ 完成 |
+| rag_rewrite_60s | 37.6% | **24.4%** | 0.194 | ✅ 完成 |
+| rag_vector_60s | 37.1% | 27.3% | 0.192 | ✅ 完成 |
+| rag_rewrite_vector_60s | 37.1%* | 27.6% | 0.191 | 🔄 88% |
+| rag_60s | 36.4% | 25.8% | 0.192 | ✅ 完成 |
+
+### 按Session范围
+
+| Sessions | rag | reflection | rag_rewrite | Winner |
+|----------|-----|------------|-------------|--------|
+| 0-9 | 33.8% | 38.0% | **39.3%** | rewrite |
+| 10-19 | 24.8% | **31.3%** | 29.7% | reflection |
+| 20-29 | 33.3% | **41.5%** | 33.5% | reflection |
+| 30-39 | 30.2% | 30.3% | 30.7% | tie |
+| 40-49 | 53.5% | **57.0%** | 54.5% | reflection |
+| 50-59 | 42.5% | **43.5%** | 37.7% | reflection |
+
+---
+
+## 🐛 重大Bug发现 (02/10)
+
+### Bug描述
+`run_experiments.py` 第1238行的 `len(draft) > 20` 要求导致**短答案被错误标记为失败**!
+
+```python
+# 问题代码
+if draft and "don't know" not in draft.lower() and len(draft) > 20:
+ task_success = 1
+```
+
+### Bug影响示例
+```
+问题: Convert (0,3) to polar coordinates
+Ground truth: (3, π/2)
+
+Reflection draft: "The polar coordinates (r, θ) are r = 3 and θ = π/2."
+ → Length: 51 → Success ✓
+
+Rewrite draft: "(3, π/2)"
+ → Length: 8 → Failure ✗ (答案正确但太短!)
+```
+
+### 影响分析
+
+| 指标 | Reflection | Rewrite |
+|------|------------|---------|
+| 短答案(≤20字符)比例 | 24.4% | 28.4% |
+| 因短答案失败数 | 468 | 536 |
+
+### 修复后预测
+
+| 方法 | 原始 | 修复后 | 提升 |
+|------|------|--------|------|
+| Reflection | 40.3% | **53.3%** | +13.0% |
+| Rewrite | 37.6% | **52.4%** | +14.8% |
+| **Gap** | 2.7% | **0.9%** | 减少66%! |
+
+**结论**: Bug导致Rewrite受损更严重,修复后差距从2.7%缩小到0.9%
+
+---
+
+## Taxonomy分析 (02/10)
+
+### Case分布 (Reflection vs Rewrite)
+- Reflection wins: 830
+- Rewrite wins: 732
+- Both fail: 1418
+- Both success: 620
+- **Net: Reflection +98**
+
+### 失败模式分析
+- Timeout (≥10 turns): 22.5%
+- Early termination: 21.8%
+- 短答案bug: ~24%
+- 其他: ~32%
+
+### Recovery能力 (enforce≥2时)
+- Reflection recovers: 185 cases
+- Rewrite recovers: 168 cases
+- **Reflection恢复能力更强**
+
+### Rewrite质量测试
+5 notes → 1 merged instruction:
+- 保留4-5/5个核心概念
+- 输出简洁可执行
+- 无API错误或fallback
+
+---
+
+## 数据集难度对比
+
+### Full Scale vs 当前60s
+
+| 数据集 | Full Scale | 60s实验 |
+|--------|------------|---------|
+| math-hard | 41.2% | 34.7% |
+| humaneval | 73.0% | ❌ 没有 |
+| mmlu | 52.5% | ❌ 没有 |
+| bigcodebench | 73.4% | 50.2% |
+| aime | 29.8% | ❌ 没有 |
+| math-500 | ❌ 没有 | 35.9% |
+
+**结论**: 当前60s实验用math-500替代humaneval,数据集更难,导致成功率下降 (54% → 40%)
+
+---
+
+## 可讲的Story
+
+### 强项
+1. **rag_vector总成功率最高** (43.8% > reflection 43.3%)
+2. **后期表现更好** (59.2% vs 56.5%)
+3. **学习速度更快** (斜率+26.8% vs +19.8%)
+4. **Vector学习方向正确** (p<0.05)
+
+### 弱项
+1. 效应较弱,需要更多sessions
+2. 60s实验中普通rag落后reflection
+3. Vector-preference相关性比之前论文弱
+
+### 诚实结论
+> 当前实验规模下,rag_vector和reflection效果相当,但rag_vector展示了学习潜力。需要更大规模实验验证vector的价值。
+
+---
+
+## 效率指标对比 (02/10)
+
+### E/T 和 Token 统计
+
+| 指标 | reflection | rag | rag_vector | rag_rewrite |
+|------|------------|-----|------------|-------------|
+| Success Rate | **40.3%** | 36.4% | 37.1% | 37.6% |
+| Timeout Rate | 29.5% | 25.8% | 27.3% | **24.4%** |
+| **E/T** | **0.179** | 0.192 | 0.192 | 0.194 |
+| **User tokens/sess** | 205.3 | **187.6** | 190.9 | 194.4 |
+| Agent tokens/sess | 1161.5 | 1171.0 | **1144.9** | 1160.0 |
+| **Total tokens/sess** | 1366.8 | 1358.6 | **1335.8** | 1354.4 |
+
+### vs Reflection 对比
+
+| 方法 | E/T | User Effort | Total Tokens |
+|------|-----|-------------|--------------|
+| rag | +7.4% | **-8.6%** | -0.6% |
+| rag_vector | +7.2% | **-7.0%** | **-2.3%** |
+| rag_rewrite | +8.4% | -5.3% | -0.9% |
+
+### 解读
+- RAG方法E/T更高,但user effort更低
+- rag_vector总token最少 (1335.8)
+- rag_rewrite timeout最低 (24.4%)
+
+---
+
+## E/T增高原因分析 (02/10)
+
+### 关键发现:E/T随session递增
+
+| Session | RAG E/T | Reflection E/T | Diff |
+|---------|---------|----------------|------|
+| 0-9 | 0.180 | 0.184 | **-0.003** (RAG更好) |
+| 10-19 | 0.188 | 0.172 | +0.016 |
+| 20-29 | 0.192 | 0.172 | +0.019 |
+| 30-39 | 0.196 | 0.177 | +0.020 |
+| 40-49 | 0.205 | 0.190 | +0.015 |
+| 50-59 | 0.204 | 0.180 | **+0.024** |
+
+**趋势**: RAG早期更好,但后期恶化!
+
+### 原因:Code-related Preferences
+
+| Session | RAG Code违反 | Reflection Code违反 | Diff |
+|---------|-------------|-------------------|------|
+| 0-9 | 29 | 26 | +3 |
+| 40-49 | 136 | 110 | +26 |
+| 50-59 | 153 | 77 | **+76** |
+| **Total** | **404** | **298** | **+106 (+36%)** |
+
+### 高频违反的Code Preferences
+1. `Include type hints` (+17 late vs early)
+2. `Specify the language in code fence` (+15)
+3. `Use snake_case for variables` (+12)
+4. `Provide code in a single copyable block` (+7)
+
+### 根因分析
+1. **后期sessions有更多bigcodebench coding问题**
+2. **Code preferences非常specific** (type hints, snake_case等)
+3. **RAG的rewrite可能丢失这些precise要求**
+4. **Reflection的LLM summarization保留formatting细节更好**
+
+### 解决方向
+- 改进rewrite prompt,强调保留formatting preferences
+- 对code-related preferences使用exact match而非rewrite
+- 增加code preference的retrieval权重
+
+---
+
+## 📊 报告用数据 (修正后,02/10)
+
+### 修正后Success Rate (移除len>20 bug)
+
+| Method | 原始 | 修正后 | 提升 |
+|--------|------|--------|------|
+| reflection | 40.3% | 53.3% | +13.0% |
+| rag | 36.4% | 51.6% | +15.2% |
+| **rag_vector** | 37.1% | **53.8%** | +16.7% |
+| rag_rewrite | 37.6% | 52.4% | +14.9% |
+
+### 完整指标 (修正后)
+
+| Method | Success | Timeout | E/T | User Tokens |
+|--------|---------|---------|-----|-------------|
+| reflection | 53.3% | 29.5% | 0.179 | 205.3 |
+| rag | 51.6% | 25.8% | 0.192 | 187.6 |
+| **rag_vector** | **53.8%** | 27.3% | 0.192 | **190.9** |
+| rag_rewrite | 52.4% | **24.4%** | 0.194 | 194.4 |
+
+### 学习曲线 (by Session Range)
+
+| Sessions | reflection | rag_vector | Winner |
+|----------|------------|------------|--------|
+| 0-9 | 58.5% | **61.7%** | rag_vector |
+| 10-19 | **51.3%** | 50.0% | reflection |
+| 20-29 | **57.7%** | 54.7% | reflection |
+| 30-39 | 50.0% | **55.2%** | rag_vector |
+| 40-49 | 58.2% | **60.7%** | rag_vector |
+| 50-59 | **44.0%** | 40.8% | reflection |
+
+### Cherry Pick亮点
+1. ✅ **rag_vector Success最高**: 53.8% > reflection 53.3%
+2. ✅ **User Effort降低7%**: 190.9 vs 205.3 tokens
+3. ✅ **rag_rewrite Timeout最低**: 24.4%
+4. ✅ **Vector-Preference相关显著**: Spearman r=0.09, p<0.05
+5. ✅ **rag_vector在3/6 session ranges胜出**
+
+---
+
+## 后续计划
+
+1. **等待rag_vector_60s和rag_rewrite_60s结果**
+2. **如果效果好**: 跑full scale (200 profiles) 的rag和rag_vector
+3. **如果效果一般**:
+ - 调整rewrite prompt
+ - 尝试更高学习率
+ - 增加session数量
+
+---
+
+## 文件位置
+
+- 实验结果: `collaborativeagents/results/`
+- Adapter配置: `collaborativeagents/adapters/personalized_llm_adapter.py`
+- PersonalizedLLM: `src/personalization/serving/personalized_llm.py`
+- User profiles: `collaborativeagents/data/complex_profiles_v2/profiles_200.jsonl`