summaryrefslogtreecommitdiff
path: root/docs/introduction_revised.md
blob: ff0fa7565afb672a6b4400837deb1d6d523cb344 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
# Revised LaTeX: Introduction

## Changes from original

1. Domain: "mathematically challenging domain" → 3 task domains (math, code)
2. Scale: "5 profiles from 40" → 60 profiles from 200, 60 sessions
3. "correlates strongly" → toned down to match actual ρ=0.040 (significant quartile test)
4. Contributions updated to reflect actual findings (efficiency gains, dual-vector)
5. "all 40 profiles" → "all 200 profiles"
6. Related work paragraphs kept as-is (still accurate)

---

## Revised LaTeX

```latex
\section{Introduction}

Large language models (LLMs) are increasingly deployed as conversational agents that assist users with complex, multi-step tasks such as information seeking, recommendation, and problem solving.
Recent surveys on LLM-based information retrieval and conversational systems highlight both the promise and the fragility of current LLM assistants: they can provide rich, multi-turn interactions, but their behavior is highly sensitive to prompts, context windows, and evaluation setups~\cite{zhu2025llm4ir,zamani2023cis,mo2025convsearch,jannach2021crs}.
In practice, many systems still treat each session as an isolated event, relying on generic system prompts and ad-hoc prompt engineering rather than explicit user modeling or long-term memory.

At the same time, there is growing interest in \emph{personalized} LLM agents that adapt to individual user preferences, interaction styles, and evolving needs over time.
From a modeling perspective, personalization can in principle cover a broad spectrum of user characteristics---stylistic preferences, topical interests, task-level constraints, safety requirements, and more.
However, most deployed systems today only expose shallow control knobs (e.g., temperature, tone presets) and lack an explicit user representation that can be shared, updated, or reused across sessions and applications.

In this work, we take a \emph{user-centric} view of long-term personalization.
Instead of hard-coding a specific domain or task, we propose a general framework in which each user is represented by a continuous vector in a preference embedding space.
This \emph{user vector} is learned from interaction, and is used to steer external memory retrieval and other personalization mechanisms.
Crucially, the framework is designed to be agnostic to the surface form of feedback: any supervision or implicit signal that can be mapped to a scalar reward (e.g., explicit ratings, good/bad labels, task success, dwell time) can, in principle, be integrated into the learning process.
The experiments in this report instantiate this framework on one concrete feedback channel (simulated user enforcement signals), but the underlying design is deliberately more general.

A central challenge in developing such user models is evaluation.
Many recent personalization and long-term memory benchmarks provide rich multi-session interaction histories but only support \emph{offline} evaluation, where models are scored on static logs.
PersonaMem and its successor PersonaMem-v2 simulate detailed, evolving user personas across many scenarios and expose benchmarks for long-context personalized reasoning~\cite{jiang2025personamem,jiang2025personamemv2}.
LongMemEval focuses on long-term interactive memory, providing carefully constructed histories and questions that probe different memory capabilities of chat assistants~\cite{wu2024longmemeval}.
These datasets are valuable for analyzing model behavior, but by construction they do not support on-policy learning or evaluation of agents that \emph{change} their strategy based on interactive feedback.

Conversely, several recent agent evaluation frameworks introduce LLM-based user simulators and on-policy interaction, but focus on different capabilities.
For example, ToolSandbox targets tool-use agents and emphasizes stateful tool execution, implicit state dependencies, and complex multi-step tool calls~\cite{lu2025toolsandbox}.
It includes a built-in user simulator and interactive evaluation loop, but does not explicitly model long-term user preferences or multi-session personalization.
To the best of our knowledge, there has been no widely adopted benchmark that simultaneously offers (i)~multi-session user--agent interactions, (ii)~an LLM-based user simulator for online feedback, and (iii)~a focus on learning and exploiting user preferences over time.

We therefore build on MultiSessionCollab, a recently introduced benchmark designed specifically for long-term collaboration with personalized user profiles~\cite{mehri2026multisessioncollab}.
MultiSessionCollab provides synthetic user profiles with rich preference structures, a user simulator that enforces these preferences during multi-session collaborative tasks, and evaluation protocols for measuring how well an agent learns and leverages those preferences over time.
Importantly, the benchmark is inherently \emph{online}: the user simulator reacts to the agent's behavior, making it possible to use feedback signals to train and adapt personalization mechanisms.

In this technical report, we instantiate our general user modeling framework on MultiSessionCollab using open-source LLMs and external memory.
We evaluate across three task domains---mathematical problem solving (math-hard, math-500) and code generation (bigcodebench)---where user preferences describe interaction and solution \emph{styles} (e.g., level of detail, preferred explanation format, code formatting conventions).
User preferences are encoded as structured condition--action pairs, and we compare several memory and personalization baselines, including context-only and reflection-based agents, against our proposed combination of retrieval-augmented memory and learned user vectors.
All methods are evaluated on $60$ user profiles over $60$ sessions each under an on-policy interaction protocol with an LLM-based user simulator.

Our contributions are three-fold:
\begin{itemize}
  \item We propose a general user modeling framework that represents each user as a learned vector in a preference embedding space, designed to work with arbitrary feedback that can be mapped to scalar rewards.
        The framework decouples the extraction of preference evidence, the representation of user state, and the downstream personalization mechanisms.
        A dual-vector design separates stable cross-session preferences from transient within-session context.

  \item We instantiate this framework on the MultiSessionCollab benchmark~\cite{mehri2026multisessioncollab} using open-source LLMs and a lightweight external memory system.
        Our experiments compare six system modes (Vanilla, Contextual, All-memory, RAG, Reflection, RAG+Vector) over $60$ profiles and $60$ sessions ($3{,}600$ sessions per method), providing a controlled comparison under frozen backbones.

  \item We present empirical evidence that the proposed RAG+Vector method significantly reduces interaction friction---lowering timeout rate ($p = 0.046$) and user effort ($p = 0.021$)---compared to a Reflection baseline, while maintaining comparable task success.
        Analysis of the learned user vectors confirms that the dual-vector design separates stable user identity from session-specific context, with long-term vectors significantly associating with cross-user preference overlap ($p = 0.006$).
\end{itemize}
```