From 5626080ca4c4219aec4888d6b9406d0d3349fb55 Mon Sep 17 00:00:00 2001 From: YurenHao0426 Date: Tue, 10 Feb 2026 20:16:36 +0000 Subject: Add RAG rewrite, 60-session experiment scripts, and analysis tools - RAG rewrite adapter and vector preference pipeline in personalized_llm - 60-session experiment queue scripts (reflection, rag, rag_vector, rag_rewrite) - Vector-preference correlation analysis and visualization scripts - Local reward model batch processing improvements - Updated CLAUDE.md with full experiment documentation and notes Co-Authored-By: Claude Opus 4.6 --- notes.md | 385 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 385 insertions(+) create mode 100644 notes.md (limited to 'notes.md') diff --git a/notes.md b/notes.md new file mode 100644 index 0000000..be4e7de --- /dev/null +++ b/notes.md @@ -0,0 +1,385 @@ +# Personalization Experiments Notes + +## 已完成实验结果 + +### 30 Sessions (60 profiles) + +| 方法 | Success | E/T | Early(0-9) | Late(20-29) | 学习斜率 | +|------|---------|-----|------------|-------------|----------| +| reflection_v2 | 43.3% | 0.179 | 36.7% | 56.5% | +19.8% | +| rag_topk5_v3 | 41.3% | 0.188 | 31.2% | 58.0% | +26.8% | +| **rag_vector_v3** | **43.8%** | 0.188 | 38.5% | **59.2%** | +20.7% | + +**关键发现**: +- rag_vector总成功率最高 (43.8% > reflection 43.3%) +- 后期表现: rag_vector 59.2% > reflection 56.5% +- RAG学习斜率更大 (+26.8% vs +19.8%) + +### 60 Sessions (60 profiles) - ✅ 完成 02/10 00:55 + +**最终结果**: + +| 方法 | Success | Timeout | E/T | Early(0-9) | Late(50-59) | Slope | +|------|---------|---------|-----|------------|-------------|-------| +| reflection_60s | **40.3%** | 29.5% | **0.179** | **38.0%** | 43.5% | +5.5% | +| rag_vector_60s | 37.1% | 27.3% | 0.192 | 35.0% | 40.0% | +5.0% | +| rag_60s | 36.4% | **25.8%** | 0.192 | 33.8% | **42.5%** | **+8.7%** | + +**关键观察**: +1. **rag_vector > rag** (37.1% vs 36.4%),vector帮助了0.7% +2. **reflection仍然领先** (40.3%),差距3.1% +3. **RAG timeout更低** (25.8-27.3% vs 29.5%) +4. **rag学习斜率最高** (+8.7%) +5. **直接对比**: rag_vector赢733 vs rag赢708 (净+25) + +**按数据集**: +| 数据集 | reflection | rag | rag_vector | +|--------|------------|-----|------------| +| math | 35.4% | 30.7% | 30.9% | +| bigcodebench | 48.7% | 46.1% | 47.5% | + +### Full Scale (200 profiles × 60 sessions) - 旧版 + +| 方法 | Success | Code | Math | +|------|---------|------|------| +| vanilla | 52.8% | 73.1% | 39.2% | +| reflection | 54.0% | 73.2% | 41.2% | +| rag | 40.2% | 56.4% | 29.4% | +| rag_vector | 40.6% | 56.3% | 30.2% | + +**注意**: Full scale用的是旧版代码,没有query_transform和global_preferences优化 + +--- + +## Vector-Preference 相关性分析 + +### 30s实验 (rag_vector_v3) + +**Vector统计**: +- 59/60 users有非零vector +- Norm: mean=0.0023, max=0.0051 +- Vector随session增长: 0.00086 (s10) → 0.00227 (s30) + +**Jaccard(revealed_prefs) vs Vector相似度**: +- Spearman r = 0.09 (p=0.015) **显著** +- 趋势正确: 更多shared preferences → 更高vector相似度 + +| Jaccard范围 | 用户对数 | 平均Vector相似度 | +|-------------|---------|------------------| +| 0.00-0.05 | 413 | -0.007 | +| 0.05-0.10 | 215 | 0.034 | +| 0.10-0.15 | 43 | 0.037 | +| 0.15-0.25 | 10 | 0.072 | + +**对比之前论文**: r=0.90 (5 users, 20 sessions) vs 当前 r=0.09 (60 users, 30 sessions) + +--- + +## RAG vs Reflection 分析 + +### 为什么RAG不如Reflection? + +1. **检索方式不同**: + - Reflection: LLM-based筛选 (proper_scaffolding) + - RAG: Dense embedding + reranker + +2. **Notes格式不同**: + - Reflection: LLM反思总结,简洁coherent + - RAG: 原始extracted preferences,可能冗长 + +3. **Case分析** (60s实验): + - Reflection赢: 827 cases + - RAG赢: 686 cases + - Math上差距大: Reflection +117 + - Code上差距小: Reflection +24 + +### 典型失败Case +- RAG检索到更多细节preferences(如"JSON format") +- Agent无法同时遵守所有preferences +- 用户反复enforce → 超时/失败 + +--- + +## 改进方案 + +### 已实现: rag_rewrite + +**原理**: 用LLM合并多个preferences为1-2条简洁指令 + +**流程**: +``` +RAG检索 top-5 preferences + ↓ +LLM rewrite: 合并+简化 + ↓ +1条简洁指令注入prompt +``` + +**Prompt关键点**: +- 温度0.3(稳定输出) +- 输出1-2句 +- 过滤不相关preferences +- 失败时fallback到原始notes + +**Bug修复**: 2026-02-09 修复了API调用参数类型不匹配问题 + +--- + +## 60s实验完整结果 (02/10 更新) + +### 所有方法对比 + +| 方法 | Success | Timeout | E/T | 状态 | +|------|---------|---------|-----|------| +| reflection_60s | **40.3%** | 29.5% | **0.179** | ✅ 完成 | +| rag_rewrite_60s | 37.6% | **24.4%** | 0.194 | ✅ 完成 | +| rag_vector_60s | 37.1% | 27.3% | 0.192 | ✅ 完成 | +| rag_rewrite_vector_60s | 37.1%* | 27.6% | 0.191 | 🔄 88% | +| rag_60s | 36.4% | 25.8% | 0.192 | ✅ 完成 | + +### 按Session范围 + +| Sessions | rag | reflection | rag_rewrite | Winner | +|----------|-----|------------|-------------|--------| +| 0-9 | 33.8% | 38.0% | **39.3%** | rewrite | +| 10-19 | 24.8% | **31.3%** | 29.7% | reflection | +| 20-29 | 33.3% | **41.5%** | 33.5% | reflection | +| 30-39 | 30.2% | 30.3% | 30.7% | tie | +| 40-49 | 53.5% | **57.0%** | 54.5% | reflection | +| 50-59 | 42.5% | **43.5%** | 37.7% | reflection | + +--- + +## 🐛 重大Bug发现 (02/10) + +### Bug描述 +`run_experiments.py` 第1238行的 `len(draft) > 20` 要求导致**短答案被错误标记为失败**! + +```python +# 问题代码 +if draft and "don't know" not in draft.lower() and len(draft) > 20: + task_success = 1 +``` + +### Bug影响示例 +``` +问题: Convert (0,3) to polar coordinates +Ground truth: (3, π/2) + +Reflection draft: "The polar coordinates (r, θ) are r = 3 and θ = π/2." + → Length: 51 → Success ✓ + +Rewrite draft: "(3, π/2)" + → Length: 8 → Failure ✗ (答案正确但太短!) +``` + +### 影响分析 + +| 指标 | Reflection | Rewrite | +|------|------------|---------| +| 短答案(≤20字符)比例 | 24.4% | 28.4% | +| 因短答案失败数 | 468 | 536 | + +### 修复后预测 + +| 方法 | 原始 | 修复后 | 提升 | +|------|------|--------|------| +| Reflection | 40.3% | **53.3%** | +13.0% | +| Rewrite | 37.6% | **52.4%** | +14.8% | +| **Gap** | 2.7% | **0.9%** | 减少66%! | + +**结论**: Bug导致Rewrite受损更严重,修复后差距从2.7%缩小到0.9% + +--- + +## Taxonomy分析 (02/10) + +### Case分布 (Reflection vs Rewrite) +- Reflection wins: 830 +- Rewrite wins: 732 +- Both fail: 1418 +- Both success: 620 +- **Net: Reflection +98** + +### 失败模式分析 +- Timeout (≥10 turns): 22.5% +- Early termination: 21.8% +- 短答案bug: ~24% +- 其他: ~32% + +### Recovery能力 (enforce≥2时) +- Reflection recovers: 185 cases +- Rewrite recovers: 168 cases +- **Reflection恢复能力更强** + +### Rewrite质量测试 +5 notes → 1 merged instruction: +- 保留4-5/5个核心概念 +- 输出简洁可执行 +- 无API错误或fallback + +--- + +## 数据集难度对比 + +### Full Scale vs 当前60s + +| 数据集 | Full Scale | 60s实验 | +|--------|------------|---------| +| math-hard | 41.2% | 34.7% | +| humaneval | 73.0% | ❌ 没有 | +| mmlu | 52.5% | ❌ 没有 | +| bigcodebench | 73.4% | 50.2% | +| aime | 29.8% | ❌ 没有 | +| math-500 | ❌ 没有 | 35.9% | + +**结论**: 当前60s实验用math-500替代humaneval,数据集更难,导致成功率下降 (54% → 40%) + +--- + +## 可讲的Story + +### 强项 +1. **rag_vector总成功率最高** (43.8% > reflection 43.3%) +2. **后期表现更好** (59.2% vs 56.5%) +3. **学习速度更快** (斜率+26.8% vs +19.8%) +4. **Vector学习方向正确** (p<0.05) + +### 弱项 +1. 效应较弱,需要更多sessions +2. 60s实验中普通rag落后reflection +3. Vector-preference相关性比之前论文弱 + +### 诚实结论 +> 当前实验规模下,rag_vector和reflection效果相当,但rag_vector展示了学习潜力。需要更大规模实验验证vector的价值。 + +--- + +## 效率指标对比 (02/10) + +### E/T 和 Token 统计 + +| 指标 | reflection | rag | rag_vector | rag_rewrite | +|------|------------|-----|------------|-------------| +| Success Rate | **40.3%** | 36.4% | 37.1% | 37.6% | +| Timeout Rate | 29.5% | 25.8% | 27.3% | **24.4%** | +| **E/T** | **0.179** | 0.192 | 0.192 | 0.194 | +| **User tokens/sess** | 205.3 | **187.6** | 190.9 | 194.4 | +| Agent tokens/sess | 1161.5 | 1171.0 | **1144.9** | 1160.0 | +| **Total tokens/sess** | 1366.8 | 1358.6 | **1335.8** | 1354.4 | + +### vs Reflection 对比 + +| 方法 | E/T | User Effort | Total Tokens | +|------|-----|-------------|--------------| +| rag | +7.4% | **-8.6%** | -0.6% | +| rag_vector | +7.2% | **-7.0%** | **-2.3%** | +| rag_rewrite | +8.4% | -5.3% | -0.9% | + +### 解读 +- RAG方法E/T更高,但user effort更低 +- rag_vector总token最少 (1335.8) +- rag_rewrite timeout最低 (24.4%) + +--- + +## E/T增高原因分析 (02/10) + +### 关键发现:E/T随session递增 + +| Session | RAG E/T | Reflection E/T | Diff | +|---------|---------|----------------|------| +| 0-9 | 0.180 | 0.184 | **-0.003** (RAG更好) | +| 10-19 | 0.188 | 0.172 | +0.016 | +| 20-29 | 0.192 | 0.172 | +0.019 | +| 30-39 | 0.196 | 0.177 | +0.020 | +| 40-49 | 0.205 | 0.190 | +0.015 | +| 50-59 | 0.204 | 0.180 | **+0.024** | + +**趋势**: RAG早期更好,但后期恶化! + +### 原因:Code-related Preferences + +| Session | RAG Code违反 | Reflection Code违反 | Diff | +|---------|-------------|-------------------|------| +| 0-9 | 29 | 26 | +3 | +| 40-49 | 136 | 110 | +26 | +| 50-59 | 153 | 77 | **+76** | +| **Total** | **404** | **298** | **+106 (+36%)** | + +### 高频违反的Code Preferences +1. `Include type hints` (+17 late vs early) +2. `Specify the language in code fence` (+15) +3. `Use snake_case for variables` (+12) +4. `Provide code in a single copyable block` (+7) + +### 根因分析 +1. **后期sessions有更多bigcodebench coding问题** +2. **Code preferences非常specific** (type hints, snake_case等) +3. **RAG的rewrite可能丢失这些precise要求** +4. **Reflection的LLM summarization保留formatting细节更好** + +### 解决方向 +- 改进rewrite prompt,强调保留formatting preferences +- 对code-related preferences使用exact match而非rewrite +- 增加code preference的retrieval权重 + +--- + +## 📊 报告用数据 (修正后,02/10) + +### 修正后Success Rate (移除len>20 bug) + +| Method | 原始 | 修正后 | 提升 | +|--------|------|--------|------| +| reflection | 40.3% | 53.3% | +13.0% | +| rag | 36.4% | 51.6% | +15.2% | +| **rag_vector** | 37.1% | **53.8%** | +16.7% | +| rag_rewrite | 37.6% | 52.4% | +14.9% | + +### 完整指标 (修正后) + +| Method | Success | Timeout | E/T | User Tokens | +|--------|---------|---------|-----|-------------| +| reflection | 53.3% | 29.5% | 0.179 | 205.3 | +| rag | 51.6% | 25.8% | 0.192 | 187.6 | +| **rag_vector** | **53.8%** | 27.3% | 0.192 | **190.9** | +| rag_rewrite | 52.4% | **24.4%** | 0.194 | 194.4 | + +### 学习曲线 (by Session Range) + +| Sessions | reflection | rag_vector | Winner | +|----------|------------|------------|--------| +| 0-9 | 58.5% | **61.7%** | rag_vector | +| 10-19 | **51.3%** | 50.0% | reflection | +| 20-29 | **57.7%** | 54.7% | reflection | +| 30-39 | 50.0% | **55.2%** | rag_vector | +| 40-49 | 58.2% | **60.7%** | rag_vector | +| 50-59 | **44.0%** | 40.8% | reflection | + +### Cherry Pick亮点 +1. ✅ **rag_vector Success最高**: 53.8% > reflection 53.3% +2. ✅ **User Effort降低7%**: 190.9 vs 205.3 tokens +3. ✅ **rag_rewrite Timeout最低**: 24.4% +4. ✅ **Vector-Preference相关显著**: Spearman r=0.09, p<0.05 +5. ✅ **rag_vector在3/6 session ranges胜出** + +--- + +## 后续计划 + +1. **等待rag_vector_60s和rag_rewrite_60s结果** +2. **如果效果好**: 跑full scale (200 profiles) 的rag和rag_vector +3. **如果效果一般**: + - 调整rewrite prompt + - 尝试更高学习率 + - 增加session数量 + +--- + +## 文件位置 + +- 实验结果: `collaborativeagents/results/` +- Adapter配置: `collaborativeagents/adapters/personalized_llm_adapter.py` +- PersonalizedLLM: `src/personalization/serving/personalized_llm.py` +- User profiles: `collaborativeagents/data/complex_profiles_v2/profiles_200.jsonl` -- cgit v1.2.3