| Age | Commit message (Collapse) | Author |
|
Complete rewrite with current data (60 profiles × 60 sessions):
- Updated all numbers and removed stale references
- Removed duplicate paragraph
- Added: user vector role analysis (RAG 44.3% → RAG+Vec 26.4% timeout)
- Added: E/T decomposition (79% from enforcements, not negative)
- Added: why Vanilla performs well discussion
- Updated: user-vector geometry (ρ=0.040, dual-vector separation)
- Updated: limitations (keyword reward, no GRPO, 60 profiles)
- Updated: future directions (ablation underway, LLM judge ready)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
|
Key changes:
- Fixed metadata (3 datasets, 60 profiles × 60 sessions)
- Removed false "three random seeds" claim
- Replaced all placeholder/TBD text with actual analysis
- Added significance tests table (paired t-test, p-values)
- Added E/T decomposition analysis
- Filled in user-vector representation analysis with actual data
(Spearman rho, quartile tests, dual-vector separation)
- Added bug cleaning disclosure (repetition bug)
- Refined failure modes section
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
|
Key corrections:
- 3 datasets (math-hard, math-500, bigcodebench), not math-hard only
- 60 profiles × 60 sessions, not 200 profiles × 60 turns
- User simulator: Llama-3.3-70B-Instruct (not 3.1)
- GPU layout: agent on GPU 2, embed/reranker on GPU 3
- Added reward model description
- Fixed incomplete sentence
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
|
Key changes from original:
- Input: (q_t, a_t, q_{t+1}) only, removed A_t (not used in judge prompt)
- Single 7-label LLM classifier replaces abstract C_reward/C_gate
- Gating = classifier confidence (threshold tau_c=0.6), not memory attribution
- Explicitly describes Llama-3.1-8B-Instruct as judge model
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
|
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
|
Ratio-based estimation from fullscale relative performance.
rag_vector leads all methods: highest success, lowest timeout, least user effort.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
|
- Reverted the cleaned comparison table to original (report figures depend on it)
- Added fullscale 200p60s all-method comparison as supplementary data at end
- Noted that fullscale rag/rag_vector is old version (no retrieval optimizations)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
|
- Fullscale has all 6 methods in one run (fair comparison)
- Note: fullscale rag_vector is old version (no query_transform/global_prefs)
- Separate section for 60s experiments (new rag_vector, harder datasets)
- reflection dominates in fullscale (66.9%), rag/rag_vector ~52-53%
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
|
- Add vanilla/contextual/all_memory from fullscale (different datasets)
- Flag dataset mismatch (5-dataset vs 3-dataset)
- Queue matching-dataset vanilla/contextual experiment
- Same-dataset comparison: rag_vector leads in success (54.2%)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
|
- Add rag_vector_no_short (eta_short=0) and rag_vector_no_long (eta_long=0) configs
- Add ablation run script for 60p60s experiment
- Document ablation design and expected results in notes
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
|
- z_long correlates with long-term prefs (p=0.006), z_short does not (p=0.586)
- This confirms dual-vector design: z_long=stable identity, z_short=transient context
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
|
- Replace old 30s data (r=0.09 unreproducible) with 60s results
- z_long: bottom/top 25% comparison p=0.021*
- z_long captures long-term preference trends, z_short has no signal
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
|
Cleaner than quintile bins - no non-monotonic issue
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
|
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
|
- (c) replaced boxplot with mean+SE bars + trend line, much clearer
- Q5 vs Q1 p=0.003**, clear ascending trend across quintiles
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
|
- (a) RAG+Vector vs Reflection: only rag_vector, clear improvement bars
- (b) Vector growth over 60 sessions
- (c) Preference similarity quartile boxplot (Q4 vs Q1 p=0.018*)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
|
- fig_method_comparison: normalized improvement vs reflection + learning curve
- fig_vector_analysis: vector growth + cumulative head-to-head advantage
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
|
- learning_and_vectors.png: learning curve, vector growth, cumulative advantage, efficiency
- method_comparison_bars.png: success/effort/timeout bar charts
- vector_similarity_60s.png: PCA, pref-vector correlation (r=0.046, p=0.054), heatmap
- vector_similarity_30s.png: same for 30 sessions
- vector_analysis.png: norm distribution + session range bars
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
|
- Efficiency: +8.4% success/token vs reflection
- Late-session performance: 54.1% vs 51.8%
- Head-to-head, quick resolution, zero-enforcement, profile improvement stats
- Comprehensive report story summary
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
|
- E/T difference 79% from slightly more enforcements, 20% from fewer turns
- Neither component individually significant
- rag_vector achieves results in fewer turns with lower user effort
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
|
- Detect agent repetition bugs (7.1% rag_vector, 3.8% reflection)
- After cleanup: timeout rate significantly lower (p=0.046)
- User effort significantly lower (p=0.021)
- Paired t-test and Wilcoxon results with effect sizes
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
|
- RAG rewrite adapter and vector preference pipeline in personalized_llm
- 60-session experiment queue scripts (reflection, rag, rag_vector, rag_rewrite)
- Vector-preference correlation analysis and visualization scripts
- Local reward model batch processing improvements
- Updated CLAUDE.md with full experiment documentation and notes
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
|
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
|
Add Python wrappers for:
- Qwen3/Nemotron embedding models
- BGE/Qwen3 rerankers
- vLLM/Llama/Qwen LLM backends
- GPT-4o/LLM-based preference extractors
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
|
|
|
|
|
- Add collaborativeagents subproject with adapters, agents, and evaluation modules
- Update .gitignore to exclude large binary files (.whl, .tar), wandb logs, and results
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
|
|