summaryrefslogtreecommitdiff
path: root/notes.md
AgeCommit message (Collapse)Author
2026-02-11Add all-methods comparison table with dataset caveatYurenHao0426
- Add vanilla/contextual/all_memory from fullscale (different datasets) - Flag dataset mismatch (5-dataset vs 3-dataset) - Queue matching-dataset vanilla/contextual experiment - Same-dataset comparison: rag_vector leads in success (54.2%) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-11Add dual-vector ablation experiment (z_short/z_long necessity)YurenHao0426
- Add rag_vector_no_short (eta_short=0) and rag_vector_no_long (eta_long=0) configs - Add ablation run script for 60p60s experiment - Document ablation design and expected results in notes Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10Add dual-vector separation analysis to notesYurenHao0426
- z_long correlates with long-term prefs (p=0.006), z_short does not (p=0.586) - This confirms dual-vector design: z_long=stable identity, z_short=transient context Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10Update vector-preference correlation to 60s z_long dataYurenHao0426
- Replace old 30s data (r=0.09 unreproducible) with 60s results - z_long: bottom/top 25% comparison p=0.021* - z_long captures long-term preference trends, z_short has no signal Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10Add additional favorable metrics for rag_vector to notesYurenHao0426
- Efficiency: +8.4% success/token vs reflection - Late-session performance: 54.1% vs 51.8% - Head-to-head, quick resolution, zero-enforcement, profile improvement stats - Comprehensive report story summary Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10Add E/T decomposition analysis to notesYurenHao0426
- E/T difference 79% from slightly more enforcements, 20% from fewer turns - Neither component individually significant - rag_vector achieves results in fewer turns with lower user effort Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10Add bug session cleanup analysis and significance tests to notesYurenHao0426
- Detect agent repetition bugs (7.1% rag_vector, 3.8% reflection) - After cleanup: timeout rate significantly lower (p=0.046) - User effort significantly lower (p=0.021) - Paired t-test and Wilcoxon results with effect sizes Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10Add RAG rewrite, 60-session experiment scripts, and analysis toolsYurenHao0426
- RAG rewrite adapter and vector preference pipeline in personalized_llm - 60-session experiment queue scripts (reflection, rag, rag_vector, rag_rewrite) - Vector-preference correlation analysis and visualization scripts - Local reward model batch processing improvements - Updated CLAUDE.md with full experiment documentation and notes Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>