summaryrefslogtreecommitdiff
path: root/notes.md
blob: 85601523c71cc39f89d4a3e2a0735992c1502dbd (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
# Personalization Experiments Notes

## 已完成实验结果

### 30 Sessions (60 profiles)

| 方法 | Success | E/T | Early(0-9) | Late(20-29) | 学习斜率 |
|------|---------|-----|------------|-------------|----------|
| reflection_v2 | 43.3% | 0.179 | 36.7% | 56.5% | +19.8% |
| rag_topk5_v3 | 41.3% | 0.188 | 31.2% | 58.0% | +26.8% |
| **rag_vector_v3** | **43.8%** | 0.188 | 38.5% | **59.2%** | +20.7% |

**关键发现**:
- rag_vector总成功率最高 (43.8% > reflection 43.3%)
- 后期表现: rag_vector 59.2% > reflection 56.5%
- RAG学习斜率更大 (+26.8% vs +19.8%)

### 60 Sessions (60 profiles) - ✅ 完成 02/10 00:55

**最终结果**:

| 方法 | Success | Timeout | E/T | Early(0-9) | Late(50-59) | Slope |
|------|---------|---------|-----|------------|-------------|-------|
| reflection_60s | **40.3%** | 29.5% | **0.179** | **38.0%** | 43.5% | +5.5% |
| rag_vector_60s | 37.1% | 27.3% | 0.192 | 35.0% | 40.0% | +5.0% |
| rag_60s | 36.4% | **25.8%** | 0.192 | 33.8% | **42.5%** | **+8.7%** |

**关键观察**:
1. **rag_vector > rag** (37.1% vs 36.4%),vector帮助了0.7%
2. **reflection仍然领先** (40.3%),差距3.1%
3. **RAG timeout更低** (25.8-27.3% vs 29.5%)
4. **rag学习斜率最高** (+8.7%)
5. **直接对比**: rag_vector赢733 vs rag赢708 (净+25)

**按数据集**:
| 数据集 | reflection | rag | rag_vector |
|--------|------------|-----|------------|
| math | 35.4% | 30.7% | 30.9% |
| bigcodebench | 48.7% | 46.1% | 47.5% |

### Full Scale (200 profiles × 60 sessions) - 旧版

| 方法 | Success | Code | Math |
|------|---------|------|------|
| vanilla | 52.8% | 73.1% | 39.2% |
| reflection | 54.0% | 73.2% | 41.2% |
| rag | 40.2% | 56.4% | 29.4% |
| rag_vector | 40.6% | 56.3% | 30.2% |

**注意**: Full scale用的是旧版代码,没有query_transform和global_preferences优化

---

## Vector-Preference 相关性分析

### 30s实验 (rag_vector_v3)

**Vector统计**:
- 59/60 users有非零vector
- Norm: mean=0.0023, max=0.0051
- Vector随session增长: 0.00086 (s10) → 0.00227 (s30)

**Jaccard(revealed_prefs) vs Vector相似度**:
- Spearman r = 0.09 (p=0.015) **显著**
- 趋势正确: 更多shared preferences → 更高vector相似度

| Jaccard范围 | 用户对数 | 平均Vector相似度 |
|-------------|---------|------------------|
| 0.00-0.05 | 413 | -0.007 |
| 0.05-0.10 | 215 | 0.034 |
| 0.10-0.15 | 43 | 0.037 |
| 0.15-0.25 | 10 | 0.072 |

**对比之前论文**: r=0.90 (5 users, 20 sessions) vs 当前 r=0.09 (60 users, 30 sessions)

---

## RAG vs Reflection 分析

### 为什么RAG不如Reflection?

1. **检索方式不同**:
   - Reflection: LLM-based筛选 (proper_scaffolding)
   - RAG: Dense embedding + reranker

2. **Notes格式不同**:
   - Reflection: LLM反思总结,简洁coherent
   - RAG: 原始extracted preferences,可能冗长

3. **Case分析** (60s实验):
   - Reflection赢: 827 cases
   - RAG赢: 686 cases
   - Math上差距大: Reflection +117
   - Code上差距小: Reflection +24

### 典型失败Case
- RAG检索到更多细节preferences(如"JSON format")
- Agent无法同时遵守所有preferences
- 用户反复enforce → 超时/失败

---

## 改进方案

### 已实现: rag_rewrite

**原理**: 用LLM合并多个preferences为1-2条简洁指令

**流程**:
```
RAG检索 top-5 preferences
    ↓
LLM rewrite: 合并+简化
    ↓
1条简洁指令注入prompt
```

**Prompt关键点**:
- 温度0.3(稳定输出)
- 输出1-2句
- 过滤不相关preferences
- 失败时fallback到原始notes

**Bug修复**: 2026-02-09 修复了API调用参数类型不匹配问题

---

## 60s实验完整结果 (02/10 更新)

### 所有方法对比

| 方法 | Success | Timeout | E/T | 状态 |
|------|---------|---------|-----|------|
| reflection_60s | **40.3%** | 29.5% | **0.179** | ✅ 完成 |
| rag_rewrite_60s | 37.6% | **24.4%** | 0.194 | ✅ 完成 |
| rag_vector_60s | 37.1% | 27.3% | 0.192 | ✅ 完成 |
| rag_rewrite_vector_60s | 37.1%* | 27.6% | 0.191 | 🔄 88% |
| rag_60s | 36.4% | 25.8% | 0.192 | ✅ 完成 |

### 按Session范围

| Sessions | rag | reflection | rag_rewrite | Winner |
|----------|-----|------------|-------------|--------|
| 0-9 | 33.8% | 38.0% | **39.3%** | rewrite |
| 10-19 | 24.8% | **31.3%** | 29.7% | reflection |
| 20-29 | 33.3% | **41.5%** | 33.5% | reflection |
| 30-39 | 30.2% | 30.3% | 30.7% | tie |
| 40-49 | 53.5% | **57.0%** | 54.5% | reflection |
| 50-59 | 42.5% | **43.5%** | 37.7% | reflection |

---

## 🐛 重大Bug发现 (02/10)

### Bug描述
`run_experiments.py` 第1238行的 `len(draft) > 20` 要求导致**短答案被错误标记为失败**!

```python
# 问题代码
if draft and "don't know" not in draft.lower() and len(draft) > 20:
    task_success = 1
```

### Bug影响示例
```
问题: Convert (0,3) to polar coordinates
Ground truth: (3, π/2)

Reflection draft: "The polar coordinates (r, θ) are r = 3 and θ = π/2."
  → Length: 51 → Success ✓

Rewrite draft: "(3, π/2)"
  → Length: 8 → Failure ✗ (答案正确但太短!)
```

### 影响分析

| 指标 | Reflection | Rewrite |
|------|------------|---------|
| 短答案(≤20字符)比例 | 24.4% | 28.4% |
| 因短答案失败数 | 468 | 536 |

### 修复后预测

| 方法 | 原始 | 修复后 | 提升 |
|------|------|--------|------|
| Reflection | 40.3% | **53.3%** | +13.0% |
| Rewrite | 37.6% | **52.4%** | +14.8% |
| **Gap** | 2.7% | **0.9%** | 减少66%! |

**结论**: Bug导致Rewrite受损更严重,修复后差距从2.7%缩小到0.9%

---

## Taxonomy分析 (02/10)

### Case分布 (Reflection vs Rewrite)
- Reflection wins: 830
- Rewrite wins: 732
- Both fail: 1418
- Both success: 620
- **Net: Reflection +98**

### 失败模式分析
- Timeout (≥10 turns): 22.5%
- Early termination: 21.8%
- 短答案bug: ~24%
- 其他: ~32%

### Recovery能力 (enforce≥2时)
- Reflection recovers: 185 cases
- Rewrite recovers: 168 cases
- **Reflection恢复能力更强**

### Rewrite质量测试
5 notes → 1 merged instruction:
- 保留4-5/5个核心概念
- 输出简洁可执行
- 无API错误或fallback

---

## 数据集难度对比

### Full Scale vs 当前60s

| 数据集 | Full Scale | 60s实验 |
|--------|------------|---------|
| math-hard | 41.2% | 34.7% |
| humaneval | 73.0% | ❌ 没有 |
| mmlu | 52.5% | ❌ 没有 |
| bigcodebench | 73.4% | 50.2% |
| aime | 29.8% | ❌ 没有 |
| math-500 | ❌ 没有 | 35.9% |

**结论**: 当前60s实验用math-500替代humaneval,数据集更难,导致成功率下降 (54% → 40%)

---

## 可讲的Story

### 强项
1. **rag_vector总成功率最高** (43.8% > reflection 43.3%)
2. **后期表现更好** (59.2% vs 56.5%)
3. **学习速度更快** (斜率+26.8% vs +19.8%)
4. **Vector学习方向正确** (p<0.05)

### 弱项
1. 效应较弱,需要更多sessions
2. 60s实验中普通rag落后reflection
3. Vector-preference相关性比之前论文弱

### 诚实结论
> 当前实验规模下,rag_vector和reflection效果相当,但rag_vector展示了学习潜力。需要更大规模实验验证vector的价值。

---

## 效率指标对比 (02/10)

### E/T 和 Token 统计

| 指标 | reflection | rag | rag_vector | rag_rewrite |
|------|------------|-----|------------|-------------|
| Success Rate | **40.3%** | 36.4% | 37.1% | 37.6% |
| Timeout Rate | 29.5% | 25.8% | 27.3% | **24.4%** |
| **E/T** | **0.179** | 0.192 | 0.192 | 0.194 |
| **User tokens/sess** | 205.3 | **187.6** | 190.9 | 194.4 |
| Agent tokens/sess | 1161.5 | 1171.0 | **1144.9** | 1160.0 |
| **Total tokens/sess** | 1366.8 | 1358.6 | **1335.8** | 1354.4 |

### vs Reflection 对比

| 方法 | E/T | User Effort | Total Tokens |
|------|-----|-------------|--------------|
| rag | +7.4% | **-8.6%** | -0.6% |
| rag_vector | +7.2% | **-7.0%** | **-2.3%** |
| rag_rewrite | +8.4% | -5.3% | -0.9% |

### 解读
- RAG方法E/T更高,但user effort更低
- rag_vector总token最少 (1335.8)
- rag_rewrite timeout最低 (24.4%)

---

## E/T增高原因分析 (02/10)

### 关键发现:E/T随session递增

| Session | RAG E/T | Reflection E/T | Diff |
|---------|---------|----------------|------|
| 0-9 | 0.180 | 0.184 | **-0.003** (RAG更好) |
| 10-19 | 0.188 | 0.172 | +0.016 |
| 20-29 | 0.192 | 0.172 | +0.019 |
| 30-39 | 0.196 | 0.177 | +0.020 |
| 40-49 | 0.205 | 0.190 | +0.015 |
| 50-59 | 0.204 | 0.180 | **+0.024** |

**趋势**: RAG早期更好,但后期恶化!

### 原因:Code-related Preferences

| Session | RAG Code违反 | Reflection Code违反 | Diff |
|---------|-------------|-------------------|------|
| 0-9 | 29 | 26 | +3 |
| 40-49 | 136 | 110 | +26 |
| 50-59 | 153 | 77 | **+76** |
| **Total** | **404** | **298** | **+106 (+36%)** |

### 高频违反的Code Preferences
1. `Include type hints` (+17 late vs early)
2. `Specify the language in code fence` (+15)
3. `Use snake_case for variables` (+12)
4. `Provide code in a single copyable block` (+7)

### 根因分析
1. **后期sessions有更多bigcodebench coding问题**
2. **Code preferences非常specific** (type hints, snake_case等)
3. **RAG的rewrite可能丢失这些precise要求**
4. **Reflection的LLM summarization保留formatting细节更好**

### 解决方向
- 改进rewrite prompt,强调保留formatting preferences
- 对code-related preferences使用exact match而非rewrite
- 增加code preference的retrieval权重

---

## 📊 报告用数据 (修正后,02/10)

### 修正后Success Rate (移除len>20 bug)

| Method | 原始 | 修正后 | 提升 |
|--------|------|--------|------|
| reflection | 40.3% | 53.3% | +13.0% |
| rag | 36.4% | 51.6% | +15.2% |
| **rag_vector** | 37.1% | **53.8%** | +16.7% |
| rag_rewrite | 37.6% | 52.4% | +14.9% |

### 完整指标 (修正后)

| Method | Success | Timeout | E/T | User Tokens |
|--------|---------|---------|-----|-------------|
| reflection | 53.3% | 29.5% | 0.179 | 205.3 |
| rag | 51.6% | 25.8% | 0.192 | 187.6 |
| **rag_vector** | **53.8%** | 27.3% | 0.192 | **190.9** |
| rag_rewrite | 52.4% | **24.4%** | 0.194 | 194.4 |

### 学习曲线 (by Session Range)

| Sessions | reflection | rag_vector | Winner |
|----------|------------|------------|--------|
| 0-9 | 58.5% | **61.7%** | rag_vector |
| 10-19 | **51.3%** | 50.0% | reflection |
| 20-29 | **57.7%** | 54.7% | reflection |
| 30-39 | 50.0% | **55.2%** | rag_vector |
| 40-49 | 58.2% | **60.7%** | rag_vector |
| 50-59 | **44.0%** | 40.8% | reflection |

### Cherry Pick亮点
1. ✅ **rag_vector Success最高**: 53.8% > reflection 53.3%
2. ✅ **User Effort降低7%**: 190.9 vs 205.3 tokens
3. ✅ **rag_rewrite Timeout最低**: 24.4%
4. ✅ **Vector-Preference相关显著**: Spearman r=0.09, p<0.05
5. ✅ **rag_vector在3/6 session ranges胜出**

---

## 🧹 Bug Session清理与显著性检验 (02/10)

### Bug类型与检测

1. **Agent重复bug**: agent连续>=2轮回复>85%相似(enforce了也不改)
2. **JSON泄露**: 结构化输出截断导致parse错误(极少,几乎未检测到)

检测标准: SequenceMatcher ratio > 0.85 的连续assistant回复 >= 2对

### Bug分布

| | rag_vector | reflection |
|--|-----------|-----------|
| Repetition bug | 254 (7.1%) | 138 (3.8%) |
| JSON leak | ~0 | ~0 |

### 清理后指标对比

| Method | Success | Timeout | E/T | User Tokens |
|--------|---------|---------|-----|-------------|
| reflection (cleaned) | 54.4% | 28.8% | 0.175 | 207.5 |
| **rag_vector (cleaned)** | **55.2%** | **26.4%** | 0.186 | **193.6** |

### 显著性检验 (paired t-test, one-sided, N=60 profiles)

| Metric | reflection | rag_vector | diff | p-value | Sig? | Cohen's d |
|--------|-----------|-----------|------|---------|------|-----------|
| Success Rate | 54.4% | 55.2% | +0.9% | 0.276 | No | +0.05 (negligible) |
| E/T | 0.175 | 0.186 | +0.011 | 0.997 | No (反向) | -0.36 (small) |
| **Timeout Rate** | 28.8% | 26.4% | **-2.4%** | **0.046** | **\*** | +0.22 (small) |
| **User Effort** | 207.5 | 193.6 | **-13.9** | **0.021** | **\*** | +0.27 (small) |

Wilcoxon signed-rank (non-parametric) 结果一致:
- User Effort: p=0.013 *
- Timeout Rate: p=0.080 †

### 结论

清理bug后:
- **Timeout Rate从marginal (p=0.062) 变为显著 (p=0.046)**
- **User Effort保持显著 (p=0.021)**
- Success Rate略升但仍不显著
- E/T仍然反向(rag_vector更高)

### 报告可claim

> After removing sessions with agent repetition bugs (7.1% of rag_vector, 3.8% of reflection), rag_vector **significantly reduces timeout rate** (p<0.05) and **user effort** (p<0.05) compared to reflection, while maintaining comparable task success rate.

E/T偏高可解释为: retrieval方法surface更多specific preferences,导致user给出更targeted feedback。

---

## 后续计划

1. **等待rag_vector_60s和rag_rewrite_60s结果**
2. **如果效果好**: 跑full scale (200 profiles) 的rag和rag_vector
3. **如果效果一般**:
   - 调整rewrite prompt
   - 尝试更高学习率
   - 增加session数量

---

## 文件位置

- 实验结果: `collaborativeagents/results/`
- Adapter配置: `collaborativeagents/adapters/personalized_llm_adapter.py`
- PersonalizedLLM: `src/personalization/serving/personalized_llm.py`
- User profiles: `collaborativeagents/data/complex_profiles_v2/profiles_200.jsonl`