summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
-rw-r--r--docs/introduction_revised.md65
1 files changed, 65 insertions, 0 deletions
diff --git a/docs/introduction_revised.md b/docs/introduction_revised.md
new file mode 100644
index 0000000..ff0fa75
--- /dev/null
+++ b/docs/introduction_revised.md
@@ -0,0 +1,65 @@
+# Revised LaTeX: Introduction
+
+## Changes from original
+
+1. Domain: "mathematically challenging domain" → 3 task domains (math, code)
+2. Scale: "5 profiles from 40" → 60 profiles from 200, 60 sessions
+3. "correlates strongly" → toned down to match actual ρ=0.040 (significant quartile test)
+4. Contributions updated to reflect actual findings (efficiency gains, dual-vector)
+5. "all 40 profiles" → "all 200 profiles"
+6. Related work paragraphs kept as-is (still accurate)
+
+---
+
+## Revised LaTeX
+
+```latex
+\section{Introduction}
+
+Large language models (LLMs) are increasingly deployed as conversational agents that assist users with complex, multi-step tasks such as information seeking, recommendation, and problem solving.
+Recent surveys on LLM-based information retrieval and conversational systems highlight both the promise and the fragility of current LLM assistants: they can provide rich, multi-turn interactions, but their behavior is highly sensitive to prompts, context windows, and evaluation setups~\cite{zhu2025llm4ir,zamani2023cis,mo2025convsearch,jannach2021crs}.
+In practice, many systems still treat each session as an isolated event, relying on generic system prompts and ad-hoc prompt engineering rather than explicit user modeling or long-term memory.
+
+At the same time, there is growing interest in \emph{personalized} LLM agents that adapt to individual user preferences, interaction styles, and evolving needs over time.
+From a modeling perspective, personalization can in principle cover a broad spectrum of user characteristics---stylistic preferences, topical interests, task-level constraints, safety requirements, and more.
+However, most deployed systems today only expose shallow control knobs (e.g., temperature, tone presets) and lack an explicit user representation that can be shared, updated, or reused across sessions and applications.
+
+In this work, we take a \emph{user-centric} view of long-term personalization.
+Instead of hard-coding a specific domain or task, we propose a general framework in which each user is represented by a continuous vector in a preference embedding space.
+This \emph{user vector} is learned from interaction, and is used to steer external memory retrieval and other personalization mechanisms.
+Crucially, the framework is designed to be agnostic to the surface form of feedback: any supervision or implicit signal that can be mapped to a scalar reward (e.g., explicit ratings, good/bad labels, task success, dwell time) can, in principle, be integrated into the learning process.
+The experiments in this report instantiate this framework on one concrete feedback channel (simulated user enforcement signals), but the underlying design is deliberately more general.
+
+A central challenge in developing such user models is evaluation.
+Many recent personalization and long-term memory benchmarks provide rich multi-session interaction histories but only support \emph{offline} evaluation, where models are scored on static logs.
+PersonaMem and its successor PersonaMem-v2 simulate detailed, evolving user personas across many scenarios and expose benchmarks for long-context personalized reasoning~\cite{jiang2025personamem,jiang2025personamemv2}.
+LongMemEval focuses on long-term interactive memory, providing carefully constructed histories and questions that probe different memory capabilities of chat assistants~\cite{wu2024longmemeval}.
+These datasets are valuable for analyzing model behavior, but by construction they do not support on-policy learning or evaluation of agents that \emph{change} their strategy based on interactive feedback.
+
+Conversely, several recent agent evaluation frameworks introduce LLM-based user simulators and on-policy interaction, but focus on different capabilities.
+For example, ToolSandbox targets tool-use agents and emphasizes stateful tool execution, implicit state dependencies, and complex multi-step tool calls~\cite{lu2025toolsandbox}.
+It includes a built-in user simulator and interactive evaluation loop, but does not explicitly model long-term user preferences or multi-session personalization.
+To the best of our knowledge, there has been no widely adopted benchmark that simultaneously offers (i)~multi-session user--agent interactions, (ii)~an LLM-based user simulator for online feedback, and (iii)~a focus on learning and exploiting user preferences over time.
+
+We therefore build on MultiSessionCollab, a recently introduced benchmark designed specifically for long-term collaboration with personalized user profiles~\cite{mehri2026multisessioncollab}.
+MultiSessionCollab provides synthetic user profiles with rich preference structures, a user simulator that enforces these preferences during multi-session collaborative tasks, and evaluation protocols for measuring how well an agent learns and leverages those preferences over time.
+Importantly, the benchmark is inherently \emph{online}: the user simulator reacts to the agent's behavior, making it possible to use feedback signals to train and adapt personalization mechanisms.
+
+In this technical report, we instantiate our general user modeling framework on MultiSessionCollab using open-source LLMs and external memory.
+We evaluate across three task domains---mathematical problem solving (math-hard, math-500) and code generation (bigcodebench)---where user preferences describe interaction and solution \emph{styles} (e.g., level of detail, preferred explanation format, code formatting conventions).
+User preferences are encoded as structured condition--action pairs, and we compare several memory and personalization baselines, including context-only and reflection-based agents, against our proposed combination of retrieval-augmented memory and learned user vectors.
+All methods are evaluated on $60$ user profiles over $60$ sessions each under an on-policy interaction protocol with an LLM-based user simulator.
+
+Our contributions are three-fold:
+\begin{itemize}
+ \item We propose a general user modeling framework that represents each user as a learned vector in a preference embedding space, designed to work with arbitrary feedback that can be mapped to scalar rewards.
+ The framework decouples the extraction of preference evidence, the representation of user state, and the downstream personalization mechanisms.
+ A dual-vector design separates stable cross-session preferences from transient within-session context.
+
+ \item We instantiate this framework on the MultiSessionCollab benchmark~\cite{mehri2026multisessioncollab} using open-source LLMs and a lightweight external memory system.
+ Our experiments compare six system modes (Vanilla, Contextual, All-memory, RAG, Reflection, RAG+Vector) over $60$ profiles and $60$ sessions ($3{,}600$ sessions per method), providing a controlled comparison under frozen backbones.
+
+ \item We present empirical evidence that the proposed RAG+Vector method significantly reduces interaction friction---lowering timeout rate ($p = 0.046$) and user effort ($p = 0.021$)---compared to a Reflection baseline, while maintaining comparable task success.
+ Analysis of the learned user vectors confirms that the dual-vector design separates stable user identity from session-specific context, with long-term vectors significantly associating with cross-user preference overlap ($p = 0.006$).
+\end{itemize}
+```