summaryrefslogtreecommitdiff
path: root/docs
diff options
context:
space:
mode:
Diffstat (limited to 'docs')
-rw-r--r--docs/conclusion_revised.md68
1 files changed, 68 insertions, 0 deletions
diff --git a/docs/conclusion_revised.md b/docs/conclusion_revised.md
new file mode 100644
index 0000000..0018169
--- /dev/null
+++ b/docs/conclusion_revised.md
@@ -0,0 +1,68 @@
+# Revised LaTeX: Conclusion
+
+## Changes from original
+
+1. All numbers updated (54.3%→55.2%, not 57%→71%; 60 profiles not 5; 3 datasets not 1)
+2. Token claim corrected: 16.9% reduction, not "halving"
+3. User-vector correlation updated: modest ρ but significant quartile test
+4. Scale description updated: 60 profiles from 200, 60 sessions
+5. Future work updated: ablation underway, LLM judge ready, 200 profiles target
+
+---
+
+## Revised LaTeX
+
+```latex
+\section{Conclusion}
+\label{sec:conclusion}
+
+We have presented a frozen-backbone personalization framework
+that combines structured preference extraction, a
+retrieval-augmented preference memory, and online user vectors
+updated via lightweight reinforcement learning.
+Rather than fine-tuning the base LLM for each user, the
+framework keeps all backbone models (chat, embedding, reranker)
+fixed and introduces personalization only through an external
+layer that represents user preferences as condition--action
+rules and aggregates them into a low-dimensional user vector.
+A key design choice is to treat feedback as a scalar reward,
+making the user-vector update agnostic to the specific feedback
+channel and, in principle, compatible with both explicit and
+implicit signals.
+
+We instantiated this framework on the
+\textsc{MultiSessionCollab} benchmark across three task domains
+(math-hard, math-500, bigcodebench) with an LLM-based user
+simulator and style-oriented user profiles.
+Under a frozen 8B backbone evaluated over $60$ profiles and
+$60$ sessions ($3{,}600$ sessions per method), our
+\textbf{RAG+Vector} method achieves the highest task success
+rate ($55.2\%$) among all six system modes, significantly
+reduces timeout rate ($26.4\%$ vs.\ $28.8\%$, $p = 0.046$)
+and user effort ($193.6$ vs.\ $207.5$ tokens, $p = 0.021$)
+compared to the Reflection baseline, and attains the highest
+interaction efficiency ($2.83$ successes per $1{,}000$ user
+tokens).
+Analysis of the learned user vectors confirms that the
+dual-vector design separates stable user identity from
+session-specific context: long-term vectors $z^{(L)}$
+significantly associate with preference overlap across users
+($p = 0.006$), while short-term vectors $z^{(S)}$ do not
+($p = 0.586$), consistent with the intended division of labor.
+
+Our study evaluates on $60$ profiles drawn from a pool of
+$200$, focuses on style preferences, and compares against a
+prompted Reflection baseline rather than the GRPO-trained
+agents from the original \textsc{MultiSessionCollab} paper.
+Several extensions are underway or planned: component
+ablations that isolate the contribution of each user-vector
+component, deployment of an LLM-based reward model to
+replace the current keyword heuristic, and scaling
+experiments to all $200$ profiles.
+We view this work as a case study demonstrating that
+lightweight user modeling on top of RAG---without any
+per-user fine-tuning of the backbone---can yield measurable
+efficiency gains in online, multi-session personalization,
+and as a starting point for larger-scale investigations of
+long-term adaptation with frozen LLM backbones.
+```