diff options
| author | YurenHao0426 <blackhao0426@gmail.com> | 2026-02-11 03:05:53 +0000 |
|---|---|---|
| committer | YurenHao0426 <blackhao0426@gmail.com> | 2026-02-11 03:05:53 +0000 |
| commit | 331db9ce19e49bbf418b1acedfe4231364c56641 (patch) | |
| tree | 5087f44ea24604a08329e73d0314e1bdf9599062 | |
| parent | 713b394a94207c8fa8a8f712bb548ee661017caf (diff) | |
Add revised conclusion LaTeX section
Updated all numbers to match current experiments:
- 55.2% success (not 71%), 60 profiles (not 5), 3 datasets
- Token reduction 16.9% (not "halving")
- Significance results (timeout p=0.046, effort p=0.021)
- Dual-vector separation (z_long p=0.006, z_short p=0.586)
- Updated future work (ablation underway, LLM judge ready)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| -rw-r--r-- | docs/conclusion_revised.md | 68 |
1 files changed, 68 insertions, 0 deletions
diff --git a/docs/conclusion_revised.md b/docs/conclusion_revised.md new file mode 100644 index 0000000..0018169 --- /dev/null +++ b/docs/conclusion_revised.md @@ -0,0 +1,68 @@ +# Revised LaTeX: Conclusion + +## Changes from original + +1. All numbers updated (54.3%→55.2%, not 57%→71%; 60 profiles not 5; 3 datasets not 1) +2. Token claim corrected: 16.9% reduction, not "halving" +3. User-vector correlation updated: modest ρ but significant quartile test +4. Scale description updated: 60 profiles from 200, 60 sessions +5. Future work updated: ablation underway, LLM judge ready, 200 profiles target + +--- + +## Revised LaTeX + +```latex +\section{Conclusion} +\label{sec:conclusion} + +We have presented a frozen-backbone personalization framework +that combines structured preference extraction, a +retrieval-augmented preference memory, and online user vectors +updated via lightweight reinforcement learning. +Rather than fine-tuning the base LLM for each user, the +framework keeps all backbone models (chat, embedding, reranker) +fixed and introduces personalization only through an external +layer that represents user preferences as condition--action +rules and aggregates them into a low-dimensional user vector. +A key design choice is to treat feedback as a scalar reward, +making the user-vector update agnostic to the specific feedback +channel and, in principle, compatible with both explicit and +implicit signals. + +We instantiated this framework on the +\textsc{MultiSessionCollab} benchmark across three task domains +(math-hard, math-500, bigcodebench) with an LLM-based user +simulator and style-oriented user profiles. +Under a frozen 8B backbone evaluated over $60$ profiles and +$60$ sessions ($3{,}600$ sessions per method), our +\textbf{RAG+Vector} method achieves the highest task success +rate ($55.2\%$) among all six system modes, significantly +reduces timeout rate ($26.4\%$ vs.\ $28.8\%$, $p = 0.046$) +and user effort ($193.6$ vs.\ $207.5$ tokens, $p = 0.021$) +compared to the Reflection baseline, and attains the highest +interaction efficiency ($2.83$ successes per $1{,}000$ user +tokens). +Analysis of the learned user vectors confirms that the +dual-vector design separates stable user identity from +session-specific context: long-term vectors $z^{(L)}$ +significantly associate with preference overlap across users +($p = 0.006$), while short-term vectors $z^{(S)}$ do not +($p = 0.586$), consistent with the intended division of labor. + +Our study evaluates on $60$ profiles drawn from a pool of +$200$, focuses on style preferences, and compares against a +prompted Reflection baseline rather than the GRPO-trained +agents from the original \textsc{MultiSessionCollab} paper. +Several extensions are underway or planned: component +ablations that isolate the contribution of each user-vector +component, deployment of an LLM-based reward model to +replace the current keyword heuristic, and scaling +experiments to all $200$ profiles. +We view this work as a case study demonstrating that +lightweight user modeling on top of RAG---without any +per-user fine-tuning of the backbone---can yield measurable +efficiency gains in online, multi-session personalization, +and as a starting point for larger-scale investigations of +long-term adaptation with frozen LLM backbones. +``` |
