personalization-user-model.git - Unnamed repository; edit this file 'description' to name the repository.

Age	Commit message (Collapse)	Author
2026-02-11	Add revised abstract LaTeX	YurenHao0426
	Updated all claims to match actual experiments: - 3 domains, 60 profiles × 60 sessions - 55.2% success (not 71%), significance tests - 16.9% user effort reduction (not "halving") - Dual-vector separation (z_long p=0.006, z_short p=0.586) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-11	Add revised introduction LaTeX section	YurenHao0426
	Key changes: - Domain: math-only → 3 task domains (math-hard, math-500, bigcodebench) - Scale: 5 profiles/40 pool → 60 profiles/200 pool, 60 sessions - "correlates strongly" → significance-based claims (p=0.006, p=0.046) - Contributions rewritten: efficiency gains + dual-vector separation - Related work paragraphs unchanged (still accurate) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-11	Add revised conclusion LaTeX section	YurenHao0426
	Updated all numbers to match current experiments: - 55.2% success (not 71%), 60 profiles (not 5), 3 datasets - Token reduction 16.9% (not "halving") - Significance results (timeout p=0.046, effort p=0.021) - Dual-vector separation (z_long p=0.006, z_short p=0.586) - Updated future work (ablation underway, LLM judge ready) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-11	Add preference format compliance paragraph to discussion	YurenHao0426
	Discusses how structured JSON preferences are harder for 8B agent to follow vs Reflection's natural language. Notes prompt template bias toward Reflection. Reports RAG+Rewrite improvement (+0.8pp success, -1.4pp timeout), closing ~50% of RAG-Reflection gap. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-11	Add revised discussion & limitations LaTeX section	YurenHao0426
	Complete rewrite with current data (60 profiles × 60 sessions): - Updated all numbers and removed stale references - Removed duplicate paragraph - Added: user vector role analysis (RAG 44.3% → RAG+Vec 26.4% timeout) - Added: E/T decomposition (79% from enforcements, not negative) - Added: why Vanilla performs well discussion - Updated: user-vector geometry (ρ=0.040, dual-vector separation) - Updated: limitations (keyword reward, no GRPO, 60 profiles) - Updated: future directions (ablation underway, LLM judge ready) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-11	Add revised results LaTeX section with actual data	YurenHao0426
	Key changes: - Fixed metadata (3 datasets, 60 profiles × 60 sessions) - Removed false "three random seeds" claim - Replaced all placeholder/TBD text with actual analysis - Added significance tests table (paired t-test, p-values) - Added E/T decomposition analysis - Filled in user-vector representation analysis with actual data (Spearman rho, quartile tests, dual-vector separation) - Added bug cleaning disclosure (repetition bug) - Refined failure modes section Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-11	Add revised experimental setup LaTeX section	YurenHao0426
	Key corrections: - 3 datasets (math-hard, math-500, bigcodebench), not math-hard only - 60 profiles × 60 sessions, not 200 profiles × 60 turns - User simulator: Llama-3.3-70B-Instruct (not 3.1) - GPU layout: agent on GPU 2, embed/reranker on GPU 3 - Added reward model description - Fixed incomplete sentence Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-11	Add revised reward modeling LaTeX section matching code implementation	YurenHao0426
	Key changes from original: - Input: (q_t, a_t, q_{t+1}) only, removed A_t (not used in judge prompt) - Single 7-label LLM classifier replaces abstract C_reward/C_gate - Gating = classifier confidence (threshold tau_c=0.6), not memory attribution - Explicitly describes Llama-3.1-8B-Instruct as judge model Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-11	Add rag (actual 60s data) to comparison table	YurenHao0426
	Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-11	Add estimated vanilla/contextual/all_memory to comparison table	YurenHao0426
	Ratio-based estimation from fullscale relative performance. rag_vector leads all methods: highest success, lowest timeout, least user effort. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-11	Restore original stats, add fullscale 6-method data as separate section	YurenHao0426
	- Reverted the cleaned comparison table to original (report figures depend on it) - Added fullscale 200p60s all-method comparison as supplementary data at end - Noted that fullscale rag/rag_vector is old version (no retrieval optimizations) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-11	Fix all-methods comparison: use same experiment run (fullscale 200p60s)	YurenHao0426
	- Fullscale has all 6 methods in one run (fair comparison) - Note: fullscale rag_vector is old version (no query_transform/global_prefs) - Separate section for 60s experiments (new rag_vector, harder datasets) - reflection dominates in fullscale (66.9%), rag/rag_vector ~52-53% Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-11	Add all-methods comparison table with dataset caveat	YurenHao0426
	- Add vanilla/contextual/all_memory from fullscale (different datasets) - Flag dataset mismatch (5-dataset vs 3-dataset) - Queue matching-dataset vanilla/contextual experiment - Same-dataset comparison: rag_vector leads in success (54.2%) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-11	Add dual-vector ablation experiment (z_short/z_long necessity)	YurenHao0426
	- Add rag_vector_no_short (eta_short=0) and rag_vector_no_long (eta_long=0) configs - Add ablation run script for 60p60s experiment - Document ablation design and expected results in notes Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10	Add dual-vector separation analysis to notes	YurenHao0426
	- z_long correlates with long-term prefs (p=0.006), z_short does not (p=0.586) - This confirms dual-vector design: z_long=stable identity, z_short=transient context Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10	Update vector-preference correlation to 60s z_long data	YurenHao0426
	- Replace old 30s data (r=0.09 unreproducible) with 60s results - z_long: bottom/top 25% comparison p=0.021* - z_long captures long-term preference trends, z_short has no signal Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10	Update (c): two-bar bottom/top 25% comparison, p=0.021*	YurenHao0426
	Cleaner than quintile bins - no non-monotonic issue Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10	Update (c) to use z_long only: Q5 vs Q1 p=0.006**	YurenHao0426
	Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10	Update main figure: quintile mean+SE bar chart for vector similarity	YurenHao0426
	- (c) replaced boxplot with mean+SE bars + trend line, much clearer - Q5 vs Q1 p=0.003**, clear ascending trend across quintiles Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10	Add main results figure for report	YurenHao0426
	- (a) RAG+Vector vs Reflection: only rag_vector, clear improvement bars - (b) Vector growth over 60 sessions - (c) Preference similarity quartile boxplot (Q4 vs Q1 p=0.018*) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10	Add clean report-ready figures	YurenHao0426
	- fig_method_comparison: normalized improvement vs reflection + learning curve - fig_vector_analysis: vector growth + cumulative head-to-head advantage Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10	Add visualization figures for report	YurenHao0426
	- learning_and_vectors.png: learning curve, vector growth, cumulative advantage, efficiency - method_comparison_bars.png: success/effort/timeout bar charts - vector_similarity_60s.png: PCA, pref-vector correlation (r=0.046, p=0.054), heatmap - vector_similarity_30s.png: same for 30 sessions - vector_analysis.png: norm distribution + session range bars Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10	Add additional favorable metrics for rag_vector to notes	YurenHao0426
	- Efficiency: +8.4% success/token vs reflection - Late-session performance: 54.1% vs 51.8% - Head-to-head, quick resolution, zero-enforcement, profile improvement stats - Comprehensive report story summary Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10	Add E/T decomposition analysis to notes	YurenHao0426
	- E/T difference 79% from slightly more enforcements, 20% from fewer turns - Neither component individually significant - rag_vector achieves results in fewer turns with lower user effort Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10	Add bug session cleanup analysis and significance tests to notes	YurenHao0426
	- Detect agent repetition bugs (7.1% rag_vector, 3.8% reflection) - After cleanup: timeout rate significantly lower (p=0.046) - User effort significantly lower (p=0.021) - Paired t-test and Wilcoxon results with effect sizes Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10	Add RAG rewrite, 60-session experiment scripts, and analysis tools	YurenHao0426
	- RAG rewrite adapter and vector preference pipeline in personalized_llm - 60-session experiment queue scripts (reflection, rag, rag_vector, rag_rewrite) - Vector-preference correlation analysis and visualization scripts - Local reward model batch processing improvements - Updated CLAUDE.md with full experiment documentation and notes Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-01-31	Add 200 user profiles dataset (43 preferences each)	YurenHao0426
	Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-27	Add model wrapper modules (embedding, reranker, llm, preference_extractor)	YurenHao0426
	Add Python wrappers for: - Qwen3/Nemotron embedding models - BGE/Qwen3 rerankers - vLLM/Llama/Qwen LLM backends - GPT-4o/LLM-based preference extractors Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-27	local reward model	YurenHao0426

2026-01-27	add CLAUDE.md	YurenHao0426

2026-01-27	Add collaborativeagents module and update gitignore	YurenHao0426
	- Add collaborativeagents subproject with adapters, agents, and evaluation modules - Update .gitignore to exclude large binary files (.whl, .tar), wandb logs, and results Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-17	Initial commit (clean history)	YurenHao0426