summaryrefslogtreecommitdiff
path: root/notes.md
diff options
context:
space:
mode:
Diffstat (limited to 'notes.md')
-rw-r--r--notes.md52
1 files changed, 52 insertions, 0 deletions
diff --git a/notes.md b/notes.md
index be4e7de..8560152 100644
--- a/notes.md
+++ b/notes.md
@@ -366,6 +366,58 @@ Rewrite draft: "(3, π/2)"
---
+## 🧹 Bug Session清理与显著性检验 (02/10)
+
+### Bug类型与检测
+
+1. **Agent重复bug**: agent连续>=2轮回复>85%相似(enforce了也不改)
+2. **JSON泄露**: 结构化输出截断导致parse错误(极少,几乎未检测到)
+
+检测标准: SequenceMatcher ratio > 0.85 的连续assistant回复 >= 2对
+
+### Bug分布
+
+| | rag_vector | reflection |
+|--|-----------|-----------|
+| Repetition bug | 254 (7.1%) | 138 (3.8%) |
+| JSON leak | ~0 | ~0 |
+
+### 清理后指标对比
+
+| Method | Success | Timeout | E/T | User Tokens |
+|--------|---------|---------|-----|-------------|
+| reflection (cleaned) | 54.4% | 28.8% | 0.175 | 207.5 |
+| **rag_vector (cleaned)** | **55.2%** | **26.4%** | 0.186 | **193.6** |
+
+### 显著性检验 (paired t-test, one-sided, N=60 profiles)
+
+| Metric | reflection | rag_vector | diff | p-value | Sig? | Cohen's d |
+|--------|-----------|-----------|------|---------|------|-----------|
+| Success Rate | 54.4% | 55.2% | +0.9% | 0.276 | No | +0.05 (negligible) |
+| E/T | 0.175 | 0.186 | +0.011 | 0.997 | No (反向) | -0.36 (small) |
+| **Timeout Rate** | 28.8% | 26.4% | **-2.4%** | **0.046** | **\*** | +0.22 (small) |
+| **User Effort** | 207.5 | 193.6 | **-13.9** | **0.021** | **\*** | +0.27 (small) |
+
+Wilcoxon signed-rank (non-parametric) 结果一致:
+- User Effort: p=0.013 *
+- Timeout Rate: p=0.080 †
+
+### 结论
+
+清理bug后:
+- **Timeout Rate从marginal (p=0.062) 变为显著 (p=0.046)**
+- **User Effort保持显著 (p=0.021)**
+- Success Rate略升但仍不显著
+- E/T仍然反向(rag_vector更高)
+
+### 报告可claim
+
+> After removing sessions with agent repetition bugs (7.1% of rag_vector, 3.8% of reflection), rag_vector **significantly reduces timeout rate** (p<0.05) and **user effort** (p<0.05) compared to reflection, while maintaining comparable task success rate.
+
+E/T偏高可解释为: retrieval方法surface更多specific preferences,导致user给出更targeted feedback。
+
+---
+
## 后续计划
1. **等待rag_vector_60s和rag_rewrite_60s结果**