diff options
| author | YurenHao0426 <blackhao0426@gmail.com> | 2026-02-10 20:33:15 +0000 |
|---|---|---|
| committer | YurenHao0426 <blackhao0426@gmail.com> | 2026-02-10 20:33:15 +0000 |
| commit | 68e4da68dadcbbfb0cd2b3ae2d5fe6468b3a09be (patch) | |
| tree | f8e616fdb5b9f991a62e046d4691214794deaca0 /notes.md | |
| parent | 0c39a60d34ad8aff7b61b244c19bfd0160d9b446 (diff) | |
Add additional favorable metrics for rag_vector to notes
- Efficiency: +8.4% success/token vs reflection
- Late-session performance: 54.1% vs 51.8%
- Head-to-head, quick resolution, zero-enforcement, profile improvement stats
- Comprehensive report story summary
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Diffstat (limited to 'notes.md')
| -rw-r--r-- | notes.md | 83 |
1 files changed, 83 insertions, 0 deletions
@@ -431,6 +431,89 @@ E/T偏高可解释为: retrieval方法surface更多specific preferences,导致 --- +## 📈 额外利好数据 (02/10) + +### Efficiency: Success per 1k User Tokens + +| Method | Successes/1k tokens | vs reflection | +|--------|-------------------|---------------| +| reflection | 2.61 | baseline | +| rag | 2.80 | +7.3% | +| **rag_vector** | **2.83** | **+8.4%** | + +→ rag_vector每1k user token获得最多success,交互效率最高 + +### 后半段表现 (session 30-59, cleaned) + +| Method | Late Success | vs reflection | +|--------|-------------|---------------| +| reflection | 51.8% | baseline | +| rag | 51.7% | -0.1% | +| **rag_vector** | **54.1%** | **+2.2%** | +| rag_rewrite | 49.9% | -1.9% | + +→ rag_vector后半段所有方法中最高,但p=0.16不显著 + +### Head-to-head Win Rate (cleaned, same profile+session) + +| 对比 | rag_vector wins | 对手 wins | ties | net | p | +|------|----------------|-----------|------|-----|---| +| vs reflection | 738 (22.9%) | 715 (22.2%) | 1773 (55.0%) | +23 | 0.56 | +| vs rag | 716 (23.0%) | 656 (21.0%) | 1746 (56.0%) | **+60** | 0.11 | + +→ vector增量: rag_vector vs rag net +60 wins,接近marginal + +### Quick Resolution (成功且<=4 turns) + +| Method | Quick sessions | Quick successes | +|--------|---------------|-----------------| +| reflection | 4.2% | 3.5% | +| rag | 4.3% | 3.4% | +| **rag_vector** | **4.7%** | **3.9%** | + +→ rag_vector更多session快速完成 + +### Zero-enforcement Success (agent第一次就做对) + +| Method | Zero-enf success | +|--------|-----------------| +| reflection | 60.2% (n=708) | +| rag | 57.1% (n=574) | +| **rag_vector** | **60.6%** (n=561) | + +→ rag_vector在无需enforce时success最高,说明preference注入有效 + +### First-turn Compliance + +| Method | 第一轮就被enforce的比例 | +|--------|----------------------| +| reflection | 7.9% | +| rag | 7.2% | +| **rag_vector** | **7.1%** | + +→ rag_vector第一轮compliance最好 + +### Profiles that Improved (late>early) + +| Method | Improved | Worsened | Ratio | +|--------|----------|----------|-------| +| reflection | 18 | 39 | 31.6% | +| **rag_vector** | **22** | 38 | **36.7%** | + +→ 更多user profile从学习中受益 + +### 报告综合Story + +> rag_vector is the most **efficient** personalization method: +> - Significantly lower user effort (p=0.021) and timeout rate (p=0.046) +> - Highest success-per-token efficiency (+8.4% vs reflection) +> - Strongest late-session performance (54.1% vs 51.8%) +> - Best first-turn compliance and zero-enforcement success +> - More profiles show improvement over sessions (36.7% vs 31.6%) +> These results suggest effective preference learning through the user vector. + +--- + ## 后续计划 1. **等待rag_vector_60s和rag_rewrite_60s结果** |
