docs/discussion_revised.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222

# Revised LaTeX: Discussion & Limitation

## Changes from original

1. All numbers updated to match 60 profiles × 60 sessions × 3 datasets
2. Removed duplicate "Beyond the aggregate table" paragraph
3. Rewrote Contextual vs RAG analysis (old narrative no longer holds)
4. Updated user-vector analysis (ρ=0.040, Q1/Q5 test, dual-vector separation)
5. Removed old variance analysis (three runs on 5 profiles)
6. Added E/T decomposition discussion
7. Added RAG timeout analysis (44.3% → 26.4% with vector)
8. Updated Limitations (60 profiles, 60 sessions, keyword reward, style-only)
9. Updated future directions (ablation studies underway, LLM judge available)

---

## Revised LaTeX

```latex
\section{Discussion \& Limitations}
\label{sec:discussion}

\paragraph{Summary of findings.}
Our experiments show that user modeling matters for
multi-session personalization, but the benefits are
nuanced.
On the three-domain evaluation ($60$~profiles,
$60$~sessions), RAG+Vector achieves the highest task
success rate ($55.2\%$), but the absolute gain over
Reflection ($54.4\%$) is small and not statistically
significant ($p = 0.276$).
The more consistent improvements appear on
interaction-efficiency metrics: RAG+Vector significantly
reduces timeout rate ($-2.4$~pp, $p = 0.046$) and user
effort ($-13.9$ tokens, $p = 0.021$) compared to
Reflection, while yielding the highest success-per-token
efficiency ($2.83$ per $1{,}000$ user tokens, $+8.4\%$
vs.\ Reflection).
These patterns suggest that a lightweight, frozen-backbone
user model primarily improves \emph{how efficiently} the
agent interacts rather than \emph{how often} it ultimately
succeeds.

\paragraph{The role of the user vector.}
Comparing RAG ($52.0\%$) with RAG+Vector ($55.2\%$)
isolates the contribution of the learned user vector,
since the two methods share identical preference extraction,
embedding, and reranking components.
The $+3.2$~pp success gain is accompanied by a dramatic
reduction in timeout rate ($44.3\% \to 26.4\%$), suggesting
that the user vector helps the retrieval policy prioritize
the most relevant preferences and avoid flooding the agent
with tangential or conflicting memory cards.
Without the vector, the RAG method suffers from
\emph{preference overload}: as the memory grows over
sessions, the retrieval system surfaces an increasingly
noisy set of preferences, which leads the agent to either
hedge (by asking clarification questions) or produce
unfocused responses.
The user vector mitigates this by biasing retrieval toward
preferences that have historically led to positive user
feedback for this particular user.

\paragraph{Comparison with Reflection.}
The Reflection baseline generates a reasoning-heavy
session-level summary after each conversation and appends
it to future prompts, providing an implicit form of
cross-session memory.
RAG+Vector matches Reflection in task success while using
$6.7\%$ fewer user tokens per session.
This efficiency advantage is practically meaningful in
deployment scenarios where user patience and token budgets
are limited.
Under a frozen backbone, a relatively cheap combination of
preference memory and user vectors can match the performance
of a reflection-style agent that incurs an extra LLM pass
per session.
In our implementation, the Reflection baseline uses prompted
reflections without GRPO fine-tuning; reproducing the
GRPO-trained reflection agents from the original
\textsc{MultiSessionCollab} paper remains future work and
would likely raise the Reflection ceiling further.

\paragraph{Why does Vanilla perform well?}
A perhaps surprising finding is that Vanilla ($54.3\%$)
nearly matches Reflection ($54.4\%$), despite having no
cross-session memory at all.
This reflects the difficulty of the evaluation setting:
with three diverse task domains and $43$ style preferences
per profile, even a memory-augmented agent cannot reliably
satisfy all preferences, and the user simulator's preference
enforcement behavior introduces substantial variance.
The result also suggests that a meaningful portion of task
success is driven by the agent's intrinsic problem-solving
ability rather than preference compliance, and that current
personalization methods yield incremental rather than
transformative gains on this benchmark.

\paragraph{Enforcement rate and interaction dynamics.}
RAG+Vector exhibits a slightly higher enforcement-per-turn
rate ($0.186$) than Reflection ($0.175$), which might
initially appear counterproductive.
However, decomposing this metric reveals that $79\%$ of the
gap comes from a marginal increase in total enforcements per
session ($1.54$ vs.\ $1.47$, $p = 0.058$), while RAG+Vector
simultaneously completes sessions in fewer turns ($8.31$ vs.\
$8.41$).
Since RAG+Vector achieves comparable success with fewer
turns and significantly lower user token cost, the slightly
elevated enforcement rate does not translate into worse
interaction quality.
We hypothesize that RAG's explicit preference injection
prompts the user simulator to give more targeted,
preference-specific feedback, whereas Reflection's
summarized notes lead to broader but less pointed
enforcement signals.

\paragraph{User-vector geometry.}
Our analysis of the learned long-term vectors $z^{(L)}_u$
reveals that they encode preference-aware structure rather
than random noise.
Users with more similar revealed preferences develop more
similar long-term vectors: the top-$25\%$ of user pairs by
preference Jaccard overlap have a mean vector cosine of
$+0.012$, versus $-0.027$ for the bottom-$25\%$
($p = 0.021$, Mann--Whitney $U$).
The dual-vector design successfully separates stable
preference identity from session-specific context:
$z^{(L)}$ significantly associates with long-term
preference overlap ($p = 0.006$), while $z^{(S)}$ shows
no such association ($p = 0.586$).
This confirms the intended division of labor---$z^{(L)}$
captures enduring user identity, while $z^{(S)}$ absorbs
transient within-session signals and decays between sessions.

At the same time, the global Spearman correlation between
Jaccard overlap and vector cosine similarity is modest
($\rho = 0.040$, $p = 0.093$), indicating that the
REINFORCE updates are effective directionally but noisy.
With only $60$ sessions of weak feedback per user, the
learned vectors capture coarse preference structure but
lack the resolution to distinguish fine-grained preference
differences.
Longer interaction horizons and stronger reward signals
may improve this resolution.

\paragraph{Failure modes and what personalization does not improve.}
Three recurring failure modes limit all methods:
(i) the agent sometimes becomes overly passive, asking
clarification questions instead of solving the task;
(ii) as preferences accumulate, the agent attempts to
satisfy all simultaneously, producing verbose or unfocused
responses; and
(iii) the user simulator occasionally disengages early
when initial responses are weak.
The user vector partially mitigates~(ii)---RAG+Vector's
timeout rate ($26.4\%$) is far lower than RAG's
($44.3\%$)---but does not address~(i) or~(iii), which
stem from the agent's generation strategy and the
simulator's tolerance, respectively.
Closing this gap likely requires tighter coupling between
the reward design and the conversation strategy
(e.g., explicitly rewarding decisive, preference-consistent
solutions rather than neutral clarification turns), as well
as memory management mechanisms that down-weight stale or
redundant preferences.

\paragraph{Limitations.}
Several limitations scope the conclusions of this study.

\emph{Scale and generalization.}
We evaluate on $60$ profiles drawn from a pool of $200$,
with $60$ sessions per profile.
While this yields $3{,}600$ sessions per method, it remains
a single benchmark with LLM-simulated users.
Extending to real users, longer interaction horizons, and
non-style preferences (e.g., topical interests, tool
choices, safety constraints) is necessary to assess
generalizability.

\emph{Reward signal.}
The current system uses a keyword-based heuristic to
classify user follow-ups into reward categories.
While fast and requiring no additional model, this
heuristic may miss subtle feedback cues.
We have implemented and validated an LLM-as-judge
alternative (Llama-3.1-8B-Instruct, $83$--$92\%$
agreement with GPT-based judges) that can serve as a
drop-in replacement for richer reward estimation in future
experiments.

\emph{Baseline strength.}
Our Reflection baseline uses prompted reflections without
GRPO fine-tuning, which may underestimate the ceiling of
reflection-based methods as reported in the original
\textsc{MultiSessionCollab} paper.

\emph{Hyperparameter sensitivity.}
Learning rates ($\eta_L$, $\eta_S$), scoring weights
($\beta_L$, $\beta_S$), decay rate ($\lambda$), and
retrieval parameters (top-$k$, reranker threshold) are
set heuristically.
Systematic hyperparameter sweeps and ablation studies
are needed to understand sensitivity.

\paragraph{Future directions.}
Three extensions are especially promising.
First, \emph{component ablations}: we are conducting
ablation experiments that disable either $z^{(S)}$
(short-term vector) or $z^{(L)}$ (long-term vector) to
quantify each component's contribution to task success and
interaction efficiency.
Second, \emph{stronger reward signals}: swapping in the
LLM-based reward model and studying how reward quality
affects user-vector learning dynamics.
Third, \emph{scaling up}: extending to all $200$ profiles
and longer session horizons to test whether a plug-in
approach---RAG plus user modeling, without per-user
fine-tuning---can sustain its advantages at larger scale
and whether the user-vector geometry becomes more
structured with more interaction data.
```