diff options
| author | YurenHao0426 <blackhao0426@gmail.com> | 2026-02-11 03:07:56 +0000 |
|---|---|---|
| committer | YurenHao0426 <blackhao0426@gmail.com> | 2026-02-11 03:07:56 +0000 |
| commit | 9adb6f9b4f1125f0e6824a8c0882a7f7d9e821c9 (patch) | |
| tree | f1786d9cc7d01cf20d4d12c4c7903acb59e58a36 /docs/introduction_revised.md | |
| parent | 331db9ce19e49bbf418b1acedfe4231364c56641 (diff) | |
Add revised introduction LaTeX section
Key changes:
- Domain: math-only → 3 task domains (math-hard, math-500, bigcodebench)
- Scale: 5 profiles/40 pool → 60 profiles/200 pool, 60 sessions
- "correlates strongly" → significance-based claims (p=0.006, p=0.046)
- Contributions rewritten: efficiency gains + dual-vector separation
- Related work paragraphs unchanged (still accurate)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Diffstat (limited to 'docs/introduction_revised.md')
| -rw-r--r-- | docs/introduction_revised.md | 65 |
1 files changed, 65 insertions, 0 deletions
diff --git a/docs/introduction_revised.md b/docs/introduction_revised.md new file mode 100644 index 0000000..ff0fa75 --- /dev/null +++ b/docs/introduction_revised.md @@ -0,0 +1,65 @@ +# Revised LaTeX: Introduction + +## Changes from original + +1. Domain: "mathematically challenging domain" → 3 task domains (math, code) +2. Scale: "5 profiles from 40" → 60 profiles from 200, 60 sessions +3. "correlates strongly" → toned down to match actual ρ=0.040 (significant quartile test) +4. Contributions updated to reflect actual findings (efficiency gains, dual-vector) +5. "all 40 profiles" → "all 200 profiles" +6. Related work paragraphs kept as-is (still accurate) + +--- + +## Revised LaTeX + +```latex +\section{Introduction} + +Large language models (LLMs) are increasingly deployed as conversational agents that assist users with complex, multi-step tasks such as information seeking, recommendation, and problem solving. +Recent surveys on LLM-based information retrieval and conversational systems highlight both the promise and the fragility of current LLM assistants: they can provide rich, multi-turn interactions, but their behavior is highly sensitive to prompts, context windows, and evaluation setups~\cite{zhu2025llm4ir,zamani2023cis,mo2025convsearch,jannach2021crs}. +In practice, many systems still treat each session as an isolated event, relying on generic system prompts and ad-hoc prompt engineering rather than explicit user modeling or long-term memory. + +At the same time, there is growing interest in \emph{personalized} LLM agents that adapt to individual user preferences, interaction styles, and evolving needs over time. +From a modeling perspective, personalization can in principle cover a broad spectrum of user characteristics---stylistic preferences, topical interests, task-level constraints, safety requirements, and more. +However, most deployed systems today only expose shallow control knobs (e.g., temperature, tone presets) and lack an explicit user representation that can be shared, updated, or reused across sessions and applications. + +In this work, we take a \emph{user-centric} view of long-term personalization. +Instead of hard-coding a specific domain or task, we propose a general framework in which each user is represented by a continuous vector in a preference embedding space. +This \emph{user vector} is learned from interaction, and is used to steer external memory retrieval and other personalization mechanisms. +Crucially, the framework is designed to be agnostic to the surface form of feedback: any supervision or implicit signal that can be mapped to a scalar reward (e.g., explicit ratings, good/bad labels, task success, dwell time) can, in principle, be integrated into the learning process. +The experiments in this report instantiate this framework on one concrete feedback channel (simulated user enforcement signals), but the underlying design is deliberately more general. + +A central challenge in developing such user models is evaluation. +Many recent personalization and long-term memory benchmarks provide rich multi-session interaction histories but only support \emph{offline} evaluation, where models are scored on static logs. +PersonaMem and its successor PersonaMem-v2 simulate detailed, evolving user personas across many scenarios and expose benchmarks for long-context personalized reasoning~\cite{jiang2025personamem,jiang2025personamemv2}. +LongMemEval focuses on long-term interactive memory, providing carefully constructed histories and questions that probe different memory capabilities of chat assistants~\cite{wu2024longmemeval}. +These datasets are valuable for analyzing model behavior, but by construction they do not support on-policy learning or evaluation of agents that \emph{change} their strategy based on interactive feedback. + +Conversely, several recent agent evaluation frameworks introduce LLM-based user simulators and on-policy interaction, but focus on different capabilities. +For example, ToolSandbox targets tool-use agents and emphasizes stateful tool execution, implicit state dependencies, and complex multi-step tool calls~\cite{lu2025toolsandbox}. +It includes a built-in user simulator and interactive evaluation loop, but does not explicitly model long-term user preferences or multi-session personalization. +To the best of our knowledge, there has been no widely adopted benchmark that simultaneously offers (i)~multi-session user--agent interactions, (ii)~an LLM-based user simulator for online feedback, and (iii)~a focus on learning and exploiting user preferences over time. + +We therefore build on MultiSessionCollab, a recently introduced benchmark designed specifically for long-term collaboration with personalized user profiles~\cite{mehri2026multisessioncollab}. +MultiSessionCollab provides synthetic user profiles with rich preference structures, a user simulator that enforces these preferences during multi-session collaborative tasks, and evaluation protocols for measuring how well an agent learns and leverages those preferences over time. +Importantly, the benchmark is inherently \emph{online}: the user simulator reacts to the agent's behavior, making it possible to use feedback signals to train and adapt personalization mechanisms. + +In this technical report, we instantiate our general user modeling framework on MultiSessionCollab using open-source LLMs and external memory. +We evaluate across three task domains---mathematical problem solving (math-hard, math-500) and code generation (bigcodebench)---where user preferences describe interaction and solution \emph{styles} (e.g., level of detail, preferred explanation format, code formatting conventions). +User preferences are encoded as structured condition--action pairs, and we compare several memory and personalization baselines, including context-only and reflection-based agents, against our proposed combination of retrieval-augmented memory and learned user vectors. +All methods are evaluated on $60$ user profiles over $60$ sessions each under an on-policy interaction protocol with an LLM-based user simulator. + +Our contributions are three-fold: +\begin{itemize} + \item We propose a general user modeling framework that represents each user as a learned vector in a preference embedding space, designed to work with arbitrary feedback that can be mapped to scalar rewards. + The framework decouples the extraction of preference evidence, the representation of user state, and the downstream personalization mechanisms. + A dual-vector design separates stable cross-session preferences from transient within-session context. + + \item We instantiate this framework on the MultiSessionCollab benchmark~\cite{mehri2026multisessioncollab} using open-source LLMs and a lightweight external memory system. + Our experiments compare six system modes (Vanilla, Contextual, All-memory, RAG, Reflection, RAG+Vector) over $60$ profiles and $60$ sessions ($3{,}600$ sessions per method), providing a controlled comparison under frozen backbones. + + \item We present empirical evidence that the proposed RAG+Vector method significantly reduces interaction friction---lowering timeout rate ($p = 0.046$) and user effort ($p = 0.021$)---compared to a Reflection baseline, while maintaining comparable task success. + Analysis of the learned user vectors confirms that the dual-vector design separates stable user identity from session-specific context, with long-term vectors significantly associating with cross-user preference overlap ($p = 0.006$). +\end{itemize} +``` |
