summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2026-02-11Add revised abstract LaTeXYurenHao0426
Updated all claims to match actual experiments: - 3 domains, 60 profiles × 60 sessions - 55.2% success (not 71%), significance tests - 16.9% user effort reduction (not "halving") - Dual-vector separation (z_long p=0.006, z_short p=0.586) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-11Add revised introduction LaTeX sectionYurenHao0426
Key changes: - Domain: math-only → 3 task domains (math-hard, math-500, bigcodebench) - Scale: 5 profiles/40 pool → 60 profiles/200 pool, 60 sessions - "correlates strongly" → significance-based claims (p=0.006, p=0.046) - Contributions rewritten: efficiency gains + dual-vector separation - Related work paragraphs unchanged (still accurate) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-11Add revised conclusion LaTeX sectionYurenHao0426
Updated all numbers to match current experiments: - 55.2% success (not 71%), 60 profiles (not 5), 3 datasets - Token reduction 16.9% (not "halving") - Significance results (timeout p=0.046, effort p=0.021) - Dual-vector separation (z_long p=0.006, z_short p=0.586) - Updated future work (ablation underway, LLM judge ready) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-11Add preference format compliance paragraph to discussionYurenHao0426
Discusses how structured JSON preferences are harder for 8B agent to follow vs Reflection's natural language. Notes prompt template bias toward Reflection. Reports RAG+Rewrite improvement (+0.8pp success, -1.4pp timeout), closing ~50% of RAG-Reflection gap. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-11Add revised discussion & limitations LaTeX sectionYurenHao0426
Complete rewrite with current data (60 profiles × 60 sessions): - Updated all numbers and removed stale references - Removed duplicate paragraph - Added: user vector role analysis (RAG 44.3% → RAG+Vec 26.4% timeout) - Added: E/T decomposition (79% from enforcements, not negative) - Added: why Vanilla performs well discussion - Updated: user-vector geometry (ρ=0.040, dual-vector separation) - Updated: limitations (keyword reward, no GRPO, 60 profiles) - Updated: future directions (ablation underway, LLM judge ready) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-11Add revised results LaTeX section with actual dataYurenHao0426
Key changes: - Fixed metadata (3 datasets, 60 profiles × 60 sessions) - Removed false "three random seeds" claim - Replaced all placeholder/TBD text with actual analysis - Added significance tests table (paired t-test, p-values) - Added E/T decomposition analysis - Filled in user-vector representation analysis with actual data (Spearman rho, quartile tests, dual-vector separation) - Added bug cleaning disclosure (repetition bug) - Refined failure modes section Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-11Add revised experimental setup LaTeX sectionYurenHao0426
Key corrections: - 3 datasets (math-hard, math-500, bigcodebench), not math-hard only - 60 profiles × 60 sessions, not 200 profiles × 60 turns - User simulator: Llama-3.3-70B-Instruct (not 3.1) - GPU layout: agent on GPU 2, embed/reranker on GPU 3 - Added reward model description - Fixed incomplete sentence Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-11Add revised reward modeling LaTeX section matching code implementationYurenHao0426
Key changes from original: - Input: (q_t, a_t, q_{t+1}) only, removed A_t (not used in judge prompt) - Single 7-label LLM classifier replaces abstract C_reward/C_gate - Gating = classifier confidence (threshold tau_c=0.6), not memory attribution - Explicitly describes Llama-3.1-8B-Instruct as judge model Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-11Add rag (actual 60s data) to comparison tableYurenHao0426
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-11Add estimated vanilla/contextual/all_memory to comparison tableYurenHao0426
Ratio-based estimation from fullscale relative performance. rag_vector leads all methods: highest success, lowest timeout, least user effort. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-11Restore original stats, add fullscale 6-method data as separate sectionYurenHao0426
- Reverted the cleaned comparison table to original (report figures depend on it) - Added fullscale 200p60s all-method comparison as supplementary data at end - Noted that fullscale rag/rag_vector is old version (no retrieval optimizations) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-11Fix all-methods comparison: use same experiment run (fullscale 200p60s)YurenHao0426
- Fullscale has all 6 methods in one run (fair comparison) - Note: fullscale rag_vector is old version (no query_transform/global_prefs) - Separate section for 60s experiments (new rag_vector, harder datasets) - reflection dominates in fullscale (66.9%), rag/rag_vector ~52-53% Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-11Add all-methods comparison table with dataset caveatYurenHao0426
- Add vanilla/contextual/all_memory from fullscale (different datasets) - Flag dataset mismatch (5-dataset vs 3-dataset) - Queue matching-dataset vanilla/contextual experiment - Same-dataset comparison: rag_vector leads in success (54.2%) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-11Add dual-vector ablation experiment (z_short/z_long necessity)YurenHao0426
- Add rag_vector_no_short (eta_short=0) and rag_vector_no_long (eta_long=0) configs - Add ablation run script for 60p60s experiment - Document ablation design and expected results in notes Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10Add dual-vector separation analysis to notesYurenHao0426
- z_long correlates with long-term prefs (p=0.006), z_short does not (p=0.586) - This confirms dual-vector design: z_long=stable identity, z_short=transient context Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10Update vector-preference correlation to 60s z_long dataYurenHao0426
- Replace old 30s data (r=0.09 unreproducible) with 60s results - z_long: bottom/top 25% comparison p=0.021* - z_long captures long-term preference trends, z_short has no signal Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10Update (c): two-bar bottom/top 25% comparison, p=0.021*YurenHao0426
Cleaner than quintile bins - no non-monotonic issue Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10Update (c) to use z_long only: Q5 vs Q1 p=0.006**YurenHao0426
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10Update main figure: quintile mean+SE bar chart for vector similarityYurenHao0426
- (c) replaced boxplot with mean+SE bars + trend line, much clearer - Q5 vs Q1 p=0.003**, clear ascending trend across quintiles Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10Add main results figure for reportYurenHao0426
- (a) RAG+Vector vs Reflection: only rag_vector, clear improvement bars - (b) Vector growth over 60 sessions - (c) Preference similarity quartile boxplot (Q4 vs Q1 p=0.018*) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10Add clean report-ready figuresYurenHao0426
- fig_method_comparison: normalized improvement vs reflection + learning curve - fig_vector_analysis: vector growth + cumulative head-to-head advantage Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10Add visualization figures for reportYurenHao0426
- learning_and_vectors.png: learning curve, vector growth, cumulative advantage, efficiency - method_comparison_bars.png: success/effort/timeout bar charts - vector_similarity_60s.png: PCA, pref-vector correlation (r=0.046, p=0.054), heatmap - vector_similarity_30s.png: same for 30 sessions - vector_analysis.png: norm distribution + session range bars Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10Add additional favorable metrics for rag_vector to notesYurenHao0426
- Efficiency: +8.4% success/token vs reflection - Late-session performance: 54.1% vs 51.8% - Head-to-head, quick resolution, zero-enforcement, profile improvement stats - Comprehensive report story summary Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10Add E/T decomposition analysis to notesYurenHao0426
- E/T difference 79% from slightly more enforcements, 20% from fewer turns - Neither component individually significant - rag_vector achieves results in fewer turns with lower user effort Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10Add bug session cleanup analysis and significance tests to notesYurenHao0426
- Detect agent repetition bugs (7.1% rag_vector, 3.8% reflection) - After cleanup: timeout rate significantly lower (p=0.046) - User effort significantly lower (p=0.021) - Paired t-test and Wilcoxon results with effect sizes Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10Add RAG rewrite, 60-session experiment scripts, and analysis toolsYurenHao0426
- RAG rewrite adapter and vector preference pipeline in personalized_llm - 60-session experiment queue scripts (reflection, rag, rag_vector, rag_rewrite) - Vector-preference correlation analysis and visualization scripts - Local reward model batch processing improvements - Updated CLAUDE.md with full experiment documentation and notes Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-01-31Add 200 user profiles dataset (43 preferences each)YurenHao0426
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-27Add model wrapper modules (embedding, reranker, llm, preference_extractor)YurenHao0426
Add Python wrappers for: - Qwen3/Nemotron embedding models - BGE/Qwen3 rerankers - vLLM/Llama/Qwen LLM backends - GPT-4o/LLM-based preference extractors Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-27local reward modelYurenHao0426
2026-01-27add CLAUDE.mdYurenHao0426
2026-01-27Add collaborativeagents module and update gitignoreYurenHao0426
- Add collaborativeagents subproject with adapters, agents, and evaluation modules - Update .gitignore to exclude large binary files (.whl, .tar), wandb logs, and results Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-17Initial commit (clean history)YurenHao0426