summaryrefslogtreecommitdiff
path: root/docs/abstract_revised.md
blob: 37bc43a8f4a9d201a54b6d371545923f747f3bd4 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
# Revised LaTeX: Abstract

## Changes from original

1. "math-hard benchmark" → 3 task domains
2. "57.0% → 71.0%" → actual numbers (55.2% best, significance claims)
3. "halving tokens (1384→689)" → 16.9% reduction in user effort (p=0.021)
4. "correlates strongly" → significant quartile test (p=0.006)
5. "small-scale pilot" → 60 profiles × 60 sessions
6. Added dual-vector finding

---

## Revised LaTeX

```latex
\begin{abstract}
Large language models (LLMs) are increasingly deployed as
conversational agents, yet most systems still treat each session
as an isolated event and lack an explicit user representation
that can be updated over time.
At the same time, per-user fine-tuning or RLHF is often
impractical in real deployments.
We present a frozen-backbone user modeling framework that
represents each user as a low-dimensional dual vector
(long-term and short-term) in a global preference space,
updated online from weak rewards.
The framework decouples preference extraction, user-state
representation, and downstream personalization: preferences are
extracted as condition--action rules, stored in a
retrieval-augmented memory, and aggregated into a user vector
that modulates retrieval scores.
Feedback from interaction---explicit or implicit---is mapped to
a scalar reward that drives REINFORCE-style updates of the user
vector, while keeping all backbone models (chat LLM, embedding,
reranker) frozen.

We instantiate this framework on the
\textsc{MultiSessionCollab} benchmark across three task domains
(math-hard, math-500, bigcodebench) with $60$ user profiles and
$60$ sessions per profile.
Our RAG+Vector agent achieves the highest task success rate
($55.2\%$) among six system modes and significantly reduces
interaction friction compared to a Reflection baseline:
timeout rate drops by $2.4$ percentage points ($p = 0.046$)
and user effort by $6.7\%$ ($p = 0.021$), yielding the highest
interaction efficiency ($2.83$ successes per $1{,}000$ user
tokens).
Analysis of the learned user vectors confirms that the
dual-vector design separates stable user identity from
session-specific context: long-term vectors significantly
associate with cross-user preference overlap ($p = 0.006$),
while short-term vectors do not ($p = 0.586$).
We discuss limitations of this simulator-based study and
outline directions for scaling to more users, richer preference
types, and stronger feedback signals.
\end{abstract}
```