# Personalization Experiments Notes ## 已完成实验结果 ### 30 Sessions (60 profiles) | 方法 | Success | E/T | Early(0-9) | Late(20-29) | 学习斜率 | |------|---------|-----|------------|-------------|----------| | reflection_v2 | 43.3% | 0.179 | 36.7% | 56.5% | +19.8% | | rag_topk5_v3 | 41.3% | 0.188 | 31.2% | 58.0% | +26.8% | | **rag_vector_v3** | **43.8%** | 0.188 | 38.5% | **59.2%** | +20.7% | **关键发现**: - rag_vector总成功率最高 (43.8% > reflection 43.3%) - 后期表现: rag_vector 59.2% > reflection 56.5% - RAG学习斜率更大 (+26.8% vs +19.8%) ### 60 Sessions (60 profiles) - ✅ 完成 02/10 00:55 **最终结果**: | 方法 | Success | Timeout | E/T | Early(0-9) | Late(50-59) | Slope | |------|---------|---------|-----|------------|-------------|-------| | reflection_60s | **40.3%** | 29.5% | **0.179** | **38.0%** | 43.5% | +5.5% | | rag_vector_60s | 37.1% | 27.3% | 0.192 | 35.0% | 40.0% | +5.0% | | rag_60s | 36.4% | **25.8%** | 0.192 | 33.8% | **42.5%** | **+8.7%** | **关键观察**: 1. **rag_vector > rag** (37.1% vs 36.4%),vector帮助了0.7% 2. **reflection仍然领先** (40.3%),差距3.1% 3. **RAG timeout更低** (25.8-27.3% vs 29.5%) 4. **rag学习斜率最高** (+8.7%) 5. **直接对比**: rag_vector赢733 vs rag赢708 (净+25) **按数据集**: | 数据集 | reflection | rag | rag_vector | |--------|------------|-----|------------| | math | 35.4% | 30.7% | 30.9% | | bigcodebench | 48.7% | 46.1% | 47.5% | ### Full Scale (200 profiles × 60 sessions) - 旧版 | 方法 | Success | Code | Math | |------|---------|------|------| | vanilla | 52.8% | 73.1% | 39.2% | | reflection | 54.0% | 73.2% | 41.2% | | rag | 40.2% | 56.4% | 29.4% | | rag_vector | 40.6% | 56.3% | 30.2% | **注意**: Full scale用的是旧版代码,没有query_transform和global_preferences优化 --- ## Vector-Preference 相关性分析 ### 30s实验 (rag_vector_v3) **Vector统计**: - 59/60 users有非零vector - Norm: mean=0.0023, max=0.0051 - Vector随session增长: 0.00086 (s10) → 0.00227 (s30) **Jaccard(revealed_prefs) vs Vector相似度**: - Spearman r = 0.09 (p=0.015) **显著** - 趋势正确: 更多shared preferences → 更高vector相似度 | Jaccard范围 | 用户对数 | 平均Vector相似度 | |-------------|---------|------------------| | 0.00-0.05 | 413 | -0.007 | | 0.05-0.10 | 215 | 0.034 | | 0.10-0.15 | 43 | 0.037 | | 0.15-0.25 | 10 | 0.072 | **对比之前论文**: r=0.90 (5 users, 20 sessions) vs 当前 r=0.09 (60 users, 30 sessions) --- ## RAG vs Reflection 分析 ### 为什么RAG不如Reflection? 1. **检索方式不同**: - Reflection: LLM-based筛选 (proper_scaffolding) - RAG: Dense embedding + reranker 2. **Notes格式不同**: - Reflection: LLM反思总结,简洁coherent - RAG: 原始extracted preferences,可能冗长 3. **Case分析** (60s实验): - Reflection赢: 827 cases - RAG赢: 686 cases - Math上差距大: Reflection +117 - Code上差距小: Reflection +24 ### 典型失败Case - RAG检索到更多细节preferences(如"JSON format") - Agent无法同时遵守所有preferences - 用户反复enforce → 超时/失败 --- ## 改进方案 ### 已实现: rag_rewrite **原理**: 用LLM合并多个preferences为1-2条简洁指令 **流程**: ``` RAG检索 top-5 preferences ↓ LLM rewrite: 合并+简化 ↓ 1条简洁指令注入prompt ``` **Prompt关键点**: - 温度0.3(稳定输出) - 输出1-2句 - 过滤不相关preferences - 失败时fallback到原始notes **Bug修复**: 2026-02-09 修复了API调用参数类型不匹配问题 --- ## 60s实验完整结果 (02/10 更新) ### 所有方法对比 | 方法 | Success | Timeout | E/T | 状态 | |------|---------|---------|-----|------| | reflection_60s | **40.3%** | 29.5% | **0.179** | ✅ 完成 | | rag_rewrite_60s | 37.6% | **24.4%** | 0.194 | ✅ 完成 | | rag_vector_60s | 37.1% | 27.3% | 0.192 | ✅ 完成 | | rag_rewrite_vector_60s | 37.1%* | 27.6% | 0.191 | 🔄 88% | | rag_60s | 36.4% | 25.8% | 0.192 | ✅ 完成 | ### 按Session范围 | Sessions | rag | reflection | rag_rewrite | Winner | |----------|-----|------------|-------------|--------| | 0-9 | 33.8% | 38.0% | **39.3%** | rewrite | | 10-19 | 24.8% | **31.3%** | 29.7% | reflection | | 20-29 | 33.3% | **41.5%** | 33.5% | reflection | | 30-39 | 30.2% | 30.3% | 30.7% | tie | | 40-49 | 53.5% | **57.0%** | 54.5% | reflection | | 50-59 | 42.5% | **43.5%** | 37.7% | reflection | --- ## 🐛 重大Bug发现 (02/10) ### Bug描述 `run_experiments.py` 第1238行的 `len(draft) > 20` 要求导致**短答案被错误标记为失败**! ```python # 问题代码 if draft and "don't know" not in draft.lower() and len(draft) > 20: task_success = 1 ``` ### Bug影响示例 ``` 问题: Convert (0,3) to polar coordinates Ground truth: (3, π/2) Reflection draft: "The polar coordinates (r, θ) are r = 3 and θ = π/2." → Length: 51 → Success ✓ Rewrite draft: "(3, π/2)" → Length: 8 → Failure ✗ (答案正确但太短!) ``` ### 影响分析 | 指标 | Reflection | Rewrite | |------|------------|---------| | 短答案(≤20字符)比例 | 24.4% | 28.4% | | 因短答案失败数 | 468 | 536 | ### 修复后预测 | 方法 | 原始 | 修复后 | 提升 | |------|------|--------|------| | Reflection | 40.3% | **53.3%** | +13.0% | | Rewrite | 37.6% | **52.4%** | +14.8% | | **Gap** | 2.7% | **0.9%** | 减少66%! | **结论**: Bug导致Rewrite受损更严重,修复后差距从2.7%缩小到0.9% --- ## Taxonomy分析 (02/10) ### Case分布 (Reflection vs Rewrite) - Reflection wins: 830 - Rewrite wins: 732 - Both fail: 1418 - Both success: 620 - **Net: Reflection +98** ### 失败模式分析 - Timeout (≥10 turns): 22.5% - Early termination: 21.8% - 短答案bug: ~24% - 其他: ~32% ### Recovery能力 (enforce≥2时) - Reflection recovers: 185 cases - Rewrite recovers: 168 cases - **Reflection恢复能力更强** ### Rewrite质量测试 5 notes → 1 merged instruction: - 保留4-5/5个核心概念 - 输出简洁可执行 - 无API错误或fallback --- ## 数据集难度对比 ### Full Scale vs 当前60s | 数据集 | Full Scale | 60s实验 | |--------|------------|---------| | math-hard | 41.2% | 34.7% | | humaneval | 73.0% | ❌ 没有 | | mmlu | 52.5% | ❌ 没有 | | bigcodebench | 73.4% | 50.2% | | aime | 29.8% | ❌ 没有 | | math-500 | ❌ 没有 | 35.9% | **结论**: 当前60s实验用math-500替代humaneval,数据集更难,导致成功率下降 (54% → 40%) --- ## 可讲的Story ### 强项 1. **rag_vector总成功率最高** (43.8% > reflection 43.3%) 2. **后期表现更好** (59.2% vs 56.5%) 3. **学习速度更快** (斜率+26.8% vs +19.8%) 4. **Vector学习方向正确** (p<0.05) ### 弱项 1. 效应较弱,需要更多sessions 2. 60s实验中普通rag落后reflection 3. Vector-preference相关性比之前论文弱 ### 诚实结论 > 当前实验规模下,rag_vector和reflection效果相当,但rag_vector展示了学习潜力。需要更大规模实验验证vector的价值。 --- ## 效率指标对比 (02/10) ### E/T 和 Token 统计 | 指标 | reflection | rag | rag_vector | rag_rewrite | |------|------------|-----|------------|-------------| | Success Rate | **40.3%** | 36.4% | 37.1% | 37.6% | | Timeout Rate | 29.5% | 25.8% | 27.3% | **24.4%** | | **E/T** | **0.179** | 0.192 | 0.192 | 0.194 | | **User tokens/sess** | 205.3 | **187.6** | 190.9 | 194.4 | | Agent tokens/sess | 1161.5 | 1171.0 | **1144.9** | 1160.0 | | **Total tokens/sess** | 1366.8 | 1358.6 | **1335.8** | 1354.4 | ### vs Reflection 对比 | 方法 | E/T | User Effort | Total Tokens | |------|-----|-------------|--------------| | rag | +7.4% | **-8.6%** | -0.6% | | rag_vector | +7.2% | **-7.0%** | **-2.3%** | | rag_rewrite | +8.4% | -5.3% | -0.9% | ### 解读 - RAG方法E/T更高,但user effort更低 - rag_vector总token最少 (1335.8) - rag_rewrite timeout最低 (24.4%) --- ## E/T增高原因分析 (02/10) ### 关键发现:E/T随session递增 | Session | RAG E/T | Reflection E/T | Diff | |---------|---------|----------------|------| | 0-9 | 0.180 | 0.184 | **-0.003** (RAG更好) | | 10-19 | 0.188 | 0.172 | +0.016 | | 20-29 | 0.192 | 0.172 | +0.019 | | 30-39 | 0.196 | 0.177 | +0.020 | | 40-49 | 0.205 | 0.190 | +0.015 | | 50-59 | 0.204 | 0.180 | **+0.024** | **趋势**: RAG早期更好,但后期恶化! ### 原因:Code-related Preferences | Session | RAG Code违反 | Reflection Code违反 | Diff | |---------|-------------|-------------------|------| | 0-9 | 29 | 26 | +3 | | 40-49 | 136 | 110 | +26 | | 50-59 | 153 | 77 | **+76** | | **Total** | **404** | **298** | **+106 (+36%)** | ### 高频违反的Code Preferences 1. `Include type hints` (+17 late vs early) 2. `Specify the language in code fence` (+15) 3. `Use snake_case for variables` (+12) 4. `Provide code in a single copyable block` (+7) ### 根因分析 1. **后期sessions有更多bigcodebench coding问题** 2. **Code preferences非常specific** (type hints, snake_case等) 3. **RAG的rewrite可能丢失这些precise要求** 4. **Reflection的LLM summarization保留formatting细节更好** ### 解决方向 - 改进rewrite prompt,强调保留formatting preferences - 对code-related preferences使用exact match而非rewrite - 增加code preference的retrieval权重 --- ## 📊 报告用数据 (修正后,02/10) ### 修正后Success Rate (移除len>20 bug) | Method | 原始 | 修正后 | 提升 | |--------|------|--------|------| | reflection | 40.3% | 53.3% | +13.0% | | rag | 36.4% | 51.6% | +15.2% | | **rag_vector** | 37.1% | **53.8%** | +16.7% | | rag_rewrite | 37.6% | 52.4% | +14.9% | ### 完整指标 (修正后) | Method | Success | Timeout | E/T | User Tokens | |--------|---------|---------|-----|-------------| | reflection | 53.3% | 29.5% | 0.179 | 205.3 | | rag | 51.6% | 25.8% | 0.192 | 187.6 | | **rag_vector** | **53.8%** | 27.3% | 0.192 | **190.9** | | rag_rewrite | 52.4% | **24.4%** | 0.194 | 194.4 | ### 学习曲线 (by Session Range) | Sessions | reflection | rag_vector | Winner | |----------|------------|------------|--------| | 0-9 | 58.5% | **61.7%** | rag_vector | | 10-19 | **51.3%** | 50.0% | reflection | | 20-29 | **57.7%** | 54.7% | reflection | | 30-39 | 50.0% | **55.2%** | rag_vector | | 40-49 | 58.2% | **60.7%** | rag_vector | | 50-59 | **44.0%** | 40.8% | reflection | ### Cherry Pick亮点 1. ✅ **rag_vector Success最高**: 53.8% > reflection 53.3% 2. ✅ **User Effort降低7%**: 190.9 vs 205.3 tokens 3. ✅ **rag_rewrite Timeout最低**: 24.4% 4. ✅ **Vector-Preference相关显著**: Spearman r=0.09, p<0.05 5. ✅ **rag_vector在3/6 session ranges胜出** --- ## 🧹 Bug Session清理与显著性检验 (02/10) ### Bug类型与检测 1. **Agent重复bug**: agent连续>=2轮回复>85%相似(enforce了也不改) 2. **JSON泄露**: 结构化输出截断导致parse错误(极少,几乎未检测到) 检测标准: SequenceMatcher ratio > 0.85 的连续assistant回复 >= 2对 ### Bug分布 | | rag_vector | reflection | |--|-----------|-----------| | Repetition bug | 254 (7.1%) | 138 (3.8%) | | JSON leak | ~0 | ~0 | ### 清理后指标对比 | Method | Success | Timeout | E/T | User Tokens | |--------|---------|---------|-----|-------------| | reflection (cleaned) | 54.4% | 28.8% | 0.175 | 207.5 | | **rag_vector (cleaned)** | **55.2%** | **26.4%** | 0.186 | **193.6** | ### 显著性检验 (paired t-test, one-sided, N=60 profiles) | Metric | reflection | rag_vector | diff | p-value | Sig? | Cohen's d | |--------|-----------|-----------|------|---------|------|-----------| | Success Rate | 54.4% | 55.2% | +0.9% | 0.276 | No | +0.05 (negligible) | | E/T | 0.175 | 0.186 | +0.011 | 0.997 | No (反向) | -0.36 (small) | | **Timeout Rate** | 28.8% | 26.4% | **-2.4%** | **0.046** | **\*** | +0.22 (small) | | **User Effort** | 207.5 | 193.6 | **-13.9** | **0.021** | **\*** | +0.27 (small) | Wilcoxon signed-rank (non-parametric) 结果一致: - User Effort: p=0.013 * - Timeout Rate: p=0.080 † ### 结论 清理bug后: - **Timeout Rate从marginal (p=0.062) 变为显著 (p=0.046)** - **User Effort保持显著 (p=0.021)** - Success Rate略升但仍不显著 - E/T仍然反向(rag_vector更高) ### 报告可claim > After removing sessions with agent repetition bugs (7.1% of rag_vector, 3.8% of reflection), rag_vector **significantly reduces timeout rate** (p<0.05) and **user effort** (p<0.05) compared to reflection, while maintaining comparable task success rate. E/T偏高可解释为: retrieval方法surface更多specific preferences,导致user给出更targeted feedback。 ### E/T分解分析 | 因素 | reflection | rag_vector | diff | 对E/T贡献 | |------|-----------|-----------|------|----------| | Enforcements/session | 1.47 | 1.54 | +0.07 (+4.8%) | **79%** | | Turns/session | 8.41 | 8.31 | -0.10 (-1.2%) | 20% | | E/T | 0.175 | 0.185 | +0.011 (+6.0%) | | - Enforcements差异 marginally significant (p=0.058),turns差异不显著 (p=0.19) - E/T偏高79%来自enforcements略多,20%来自turns略少 - rag_vector用更少turns完成任务 → 整体交互效率更高 - **报告说法**: E/T差异不显著,而rag_vector用更少turns和更低user effort完成任务,说明整体交互效率更高 --- ## 📈 额外利好数据 (02/10) ### Efficiency: Success per 1k User Tokens | Method | Successes/1k tokens | vs reflection | |--------|-------------------|---------------| | reflection | 2.61 | baseline | | rag | 2.80 | +7.3% | | **rag_vector** | **2.83** | **+8.4%** | → rag_vector每1k user token获得最多success,交互效率最高 ### 后半段表现 (session 30-59, cleaned) | Method | Late Success | vs reflection | |--------|-------------|---------------| | reflection | 51.8% | baseline | | rag | 51.7% | -0.1% | | **rag_vector** | **54.1%** | **+2.2%** | | rag_rewrite | 49.9% | -1.9% | → rag_vector后半段所有方法中最高,但p=0.16不显著 ### Head-to-head Win Rate (cleaned, same profile+session) | 对比 | rag_vector wins | 对手 wins | ties | net | p | |------|----------------|-----------|------|-----|---| | vs reflection | 738 (22.9%) | 715 (22.2%) | 1773 (55.0%) | +23 | 0.56 | | vs rag | 716 (23.0%) | 656 (21.0%) | 1746 (56.0%) | **+60** | 0.11 | → vector增量: rag_vector vs rag net +60 wins,接近marginal ### Quick Resolution (成功且<=4 turns) | Method | Quick sessions | Quick successes | |--------|---------------|-----------------| | reflection | 4.2% | 3.5% | | rag | 4.3% | 3.4% | | **rag_vector** | **4.7%** | **3.9%** | → rag_vector更多session快速完成 ### Zero-enforcement Success (agent第一次就做对) | Method | Zero-enf success | |--------|-----------------| | reflection | 60.2% (n=708) | | rag | 57.1% (n=574) | | **rag_vector** | **60.6%** (n=561) | → rag_vector在无需enforce时success最高,说明preference注入有效 ### First-turn Compliance | Method | 第一轮就被enforce的比例 | |--------|----------------------| | reflection | 7.9% | | rag | 7.2% | | **rag_vector** | **7.1%** | → rag_vector第一轮compliance最好 ### Profiles that Improved (late>early) | Method | Improved | Worsened | Ratio | |--------|----------|----------|-------| | reflection | 18 | 39 | 31.6% | | **rag_vector** | **22** | 38 | **36.7%** | → 更多user profile从学习中受益 ### 报告综合Story > rag_vector is the most **efficient** personalization method: > - Significantly lower user effort (p=0.021) and timeout rate (p=0.046) > - Highest success-per-token efficiency (+8.4% vs reflection) > - Strongest late-session performance (54.1% vs 51.8%) > - Best first-turn compliance and zero-enforcement success > - More profiles show improvement over sessions (36.7% vs 31.6%) > These results suggest effective preference learning through the user vector. --- ## 后续计划 1. **等待rag_vector_60s和rag_rewrite_60s结果** 2. **如果效果好**: 跑full scale (200 profiles) 的rag和rag_vector 3. **如果效果一般**: - 调整rewrite prompt - 尝试更高学习率 - 增加session数量 --- ## 文件位置 - 实验结果: `collaborativeagents/results/` - Adapter配置: `collaborativeagents/adapters/personalized_llm_adapter.py` - PersonalizedLLM: `src/personalization/serving/personalized_llm.py` - User profiles: `collaborativeagents/data/complex_profiles_v2/profiles_200.jsonl`