diff options
Diffstat (limited to 'notes.md')
| -rw-r--r-- | notes.md | 52 |
1 files changed, 52 insertions, 0 deletions
@@ -366,6 +366,58 @@ Rewrite draft: "(3, π/2)" --- +## 🧹 Bug Session清理与显著性检验 (02/10) + +### Bug类型与检测 + +1. **Agent重复bug**: agent连续>=2轮回复>85%相似(enforce了也不改) +2. **JSON泄露**: 结构化输出截断导致parse错误(极少,几乎未检测到) + +检测标准: SequenceMatcher ratio > 0.85 的连续assistant回复 >= 2对 + +### Bug分布 + +| | rag_vector | reflection | +|--|-----------|-----------| +| Repetition bug | 254 (7.1%) | 138 (3.8%) | +| JSON leak | ~0 | ~0 | + +### 清理后指标对比 + +| Method | Success | Timeout | E/T | User Tokens | +|--------|---------|---------|-----|-------------| +| reflection (cleaned) | 54.4% | 28.8% | 0.175 | 207.5 | +| **rag_vector (cleaned)** | **55.2%** | **26.4%** | 0.186 | **193.6** | + +### 显著性检验 (paired t-test, one-sided, N=60 profiles) + +| Metric | reflection | rag_vector | diff | p-value | Sig? | Cohen's d | +|--------|-----------|-----------|------|---------|------|-----------| +| Success Rate | 54.4% | 55.2% | +0.9% | 0.276 | No | +0.05 (negligible) | +| E/T | 0.175 | 0.186 | +0.011 | 0.997 | No (反向) | -0.36 (small) | +| **Timeout Rate** | 28.8% | 26.4% | **-2.4%** | **0.046** | **\*** | +0.22 (small) | +| **User Effort** | 207.5 | 193.6 | **-13.9** | **0.021** | **\*** | +0.27 (small) | + +Wilcoxon signed-rank (non-parametric) 结果一致: +- User Effort: p=0.013 * +- Timeout Rate: p=0.080 † + +### 结论 + +清理bug后: +- **Timeout Rate从marginal (p=0.062) 变为显著 (p=0.046)** +- **User Effort保持显著 (p=0.021)** +- Success Rate略升但仍不显著 +- E/T仍然反向(rag_vector更高) + +### 报告可claim + +> After removing sessions with agent repetition bugs (7.1% of rag_vector, 3.8% of reflection), rag_vector **significantly reduces timeout rate** (p<0.05) and **user effort** (p<0.05) compared to reflection, while maintaining comparable task success rate. + +E/T偏高可解释为: retrieval方法surface更多specific preferences,导致user给出更targeted feedback。 + +--- + ## 后续计划 1. **等待rag_vector_60s和rag_rewrite_60s结果** |
