| Age | Commit message (Collapse) | Author |
|
- Reverted the cleaned comparison table to original (report figures depend on it)
- Added fullscale 200p60s all-method comparison as supplementary data at end
- Noted that fullscale rag/rag_vector is old version (no retrieval optimizations)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
|
- Fullscale has all 6 methods in one run (fair comparison)
- Note: fullscale rag_vector is old version (no query_transform/global_prefs)
- Separate section for 60s experiments (new rag_vector, harder datasets)
- reflection dominates in fullscale (66.9%), rag/rag_vector ~52-53%
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
|
- Add vanilla/contextual/all_memory from fullscale (different datasets)
- Flag dataset mismatch (5-dataset vs 3-dataset)
- Queue matching-dataset vanilla/contextual experiment
- Same-dataset comparison: rag_vector leads in success (54.2%)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
|
- Add rag_vector_no_short (eta_short=0) and rag_vector_no_long (eta_long=0) configs
- Add ablation run script for 60p60s experiment
- Document ablation design and expected results in notes
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
|
- z_long correlates with long-term prefs (p=0.006), z_short does not (p=0.586)
- This confirms dual-vector design: z_long=stable identity, z_short=transient context
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
|
- Replace old 30s data (r=0.09 unreproducible) with 60s results
- z_long: bottom/top 25% comparison p=0.021*
- z_long captures long-term preference trends, z_short has no signal
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
|
- Efficiency: +8.4% success/token vs reflection
- Late-session performance: 54.1% vs 51.8%
- Head-to-head, quick resolution, zero-enforcement, profile improvement stats
- Comprehensive report story summary
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
|
- E/T difference 79% from slightly more enforcements, 20% from fewer turns
- Neither component individually significant
- rag_vector achieves results in fewer turns with lower user effort
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
|
- Detect agent repetition bugs (7.1% rag_vector, 3.8% reflection)
- After cleanup: timeout rate significantly lower (p=0.046)
- User effort significantly lower (p=0.021)
- Paired t-test and Wilcoxon results with effect sizes
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
|
- RAG rewrite adapter and vector preference pipeline in personalized_llm
- 60-session experiment queue scripts (reflection, rag, rag_vector, rag_rewrite)
- Vector-preference correlation analysis and visualization scripts
- Local reward model batch processing improvements
- Updated CLAUDE.md with full experiment documentation and notes
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|