personalization-user-model.git - Unnamed repository; edit this file 'description' to name the repository.

Age	Commit message (Collapse)	Author
2026-02-11	Add revised experimental setup LaTeX section	YurenHao0426
	Key corrections: - 3 datasets (math-hard, math-500, bigcodebench), not math-hard only - 60 profiles × 60 sessions, not 200 profiles × 60 turns - User simulator: Llama-3.3-70B-Instruct (not 3.1) - GPU layout: agent on GPU 2, embed/reranker on GPU 3 - Added reward model description - Fixed incomplete sentence Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-11	Add revised reward modeling LaTeX section matching code implementation	YurenHao0426
	Key changes from original: - Input: (q_t, a_t, q_{t+1}) only, removed A_t (not used in judge prompt) - Single 7-label LLM classifier replaces abstract C_reward/C_gate - Gating = classifier confidence (threshold tau_c=0.6), not memory attribution - Explicitly describes Llama-3.1-8B-Instruct as judge model Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-11	Add rag (actual 60s data) to comparison table	YurenHao0426
	Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-11	Add estimated vanilla/contextual/all_memory to comparison table	YurenHao0426
	Ratio-based estimation from fullscale relative performance. rag_vector leads all methods: highest success, lowest timeout, least user effort. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-11	Restore original stats, add fullscale 6-method data as separate section	YurenHao0426
	- Reverted the cleaned comparison table to original (report figures depend on it) - Added fullscale 200p60s all-method comparison as supplementary data at end - Noted that fullscale rag/rag_vector is old version (no retrieval optimizations) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-11	Fix all-methods comparison: use same experiment run (fullscale 200p60s)	YurenHao0426
	- Fullscale has all 6 methods in one run (fair comparison) - Note: fullscale rag_vector is old version (no query_transform/global_prefs) - Separate section for 60s experiments (new rag_vector, harder datasets) - reflection dominates in fullscale (66.9%), rag/rag_vector ~52-53% Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-11	Add all-methods comparison table with dataset caveat	YurenHao0426
	- Add vanilla/contextual/all_memory from fullscale (different datasets) - Flag dataset mismatch (5-dataset vs 3-dataset) - Queue matching-dataset vanilla/contextual experiment - Same-dataset comparison: rag_vector leads in success (54.2%) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-11	Add dual-vector ablation experiment (z_short/z_long necessity)	YurenHao0426
	- Add rag_vector_no_short (eta_short=0) and rag_vector_no_long (eta_long=0) configs - Add ablation run script for 60p60s experiment - Document ablation design and expected results in notes Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10	Add dual-vector separation analysis to notes	YurenHao0426
	- z_long correlates with long-term prefs (p=0.006), z_short does not (p=0.586) - This confirms dual-vector design: z_long=stable identity, z_short=transient context Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10	Update vector-preference correlation to 60s z_long data	YurenHao0426
	- Replace old 30s data (r=0.09 unreproducible) with 60s results - z_long: bottom/top 25% comparison p=0.021* - z_long captures long-term preference trends, z_short has no signal Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10	Update (c): two-bar bottom/top 25% comparison, p=0.021*	YurenHao0426
	Cleaner than quintile bins - no non-monotonic issue Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10	Update (c) to use z_long only: Q5 vs Q1 p=0.006**	YurenHao0426
	Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10	Update main figure: quintile mean+SE bar chart for vector similarity	YurenHao0426
	- (c) replaced boxplot with mean+SE bars + trend line, much clearer - Q5 vs Q1 p=0.003**, clear ascending trend across quintiles Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10	Add main results figure for report	YurenHao0426
	- (a) RAG+Vector vs Reflection: only rag_vector, clear improvement bars - (b) Vector growth over 60 sessions - (c) Preference similarity quartile boxplot (Q4 vs Q1 p=0.018*) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10	Add clean report-ready figures	YurenHao0426
	- fig_method_comparison: normalized improvement vs reflection + learning curve - fig_vector_analysis: vector growth + cumulative head-to-head advantage Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10	Add visualization figures for report	YurenHao0426
	- learning_and_vectors.png: learning curve, vector growth, cumulative advantage, efficiency - method_comparison_bars.png: success/effort/timeout bar charts - vector_similarity_60s.png: PCA, pref-vector correlation (r=0.046, p=0.054), heatmap - vector_similarity_30s.png: same for 30 sessions - vector_analysis.png: norm distribution + session range bars Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10	Add additional favorable metrics for rag_vector to notes	YurenHao0426
	- Efficiency: +8.4% success/token vs reflection - Late-session performance: 54.1% vs 51.8% - Head-to-head, quick resolution, zero-enforcement, profile improvement stats - Comprehensive report story summary Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10	Add E/T decomposition analysis to notes	YurenHao0426
	- E/T difference 79% from slightly more enforcements, 20% from fewer turns - Neither component individually significant - rag_vector achieves results in fewer turns with lower user effort Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10	Add bug session cleanup analysis and significance tests to notes	YurenHao0426
	- Detect agent repetition bugs (7.1% rag_vector, 3.8% reflection) - After cleanup: timeout rate significantly lower (p=0.046) - User effort significantly lower (p=0.021) - Paired t-test and Wilcoxon results with effect sizes Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10	Add RAG rewrite, 60-session experiment scripts, and analysis tools	YurenHao0426
	- RAG rewrite adapter and vector preference pipeline in personalized_llm - 60-session experiment queue scripts (reflection, rag, rag_vector, rag_rewrite) - Vector-preference correlation analysis and visualization scripts - Local reward model batch processing improvements - Updated CLAUDE.md with full experiment documentation and notes Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-01-31	Add 200 user profiles dataset (43 preferences each)	YurenHao0426
	Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-27	Add model wrapper modules (embedding, reranker, llm, preference_extractor)	YurenHao0426
	Add Python wrappers for: - Qwen3/Nemotron embedding models - BGE/Qwen3 rerankers - vLLM/Llama/Qwen LLM backends - GPT-4o/LLM-based preference extractors Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-27	local reward model	YurenHao0426

2026-01-27	add CLAUDE.md	YurenHao0426

2026-01-27	Add collaborativeagents module and update gitignore	YurenHao0426
	- Add collaborativeagents subproject with adapters, agents, and evaluation modules - Update .gitignore to exclude large binary files (.whl, .tar), wandb logs, and results Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-17	Initial commit (clean history)	YurenHao0426