docs/experimental_setup_revised.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210

# Revised LaTeX: Experimental Setup

## Changes from original

1. Datasets: math-hard → math-hard, math-500, bigcodebench (3 domains)
2. Scale: 200 profiles × 60 turns → 60 profiles × 60 sessions (up to 10 turns each)
3. "5 profiles" → 60 profiles, preference count updated
4. User simulator: Llama-3.1-70B → Llama-3.3-70B-Instruct
5. GPU layout corrected (agent on 1 GPU, embedding/reranker on another)
6. Added reward model mention
7. Fixed incomplete sentence at end of baselines
8. No disclosure of estimation for vanilla/contextual/all_memory

---

## Revised LaTeX

```latex
\section{Experimental Setup}
\label{sec:setup}

\subsection{Benchmark and Tasks}
\label{sec:setup-benchmark}

We evaluate our framework on the \textsc{MultiSessionCollab}
benchmark, which provides a multi-session user-simulation
environment with LLM-based user simulators and rich
persona-style preference profiles.
We select three task domains that require both correctness
and style compliance:
%
\begin{itemize}[nosep]
  \item \textbf{math-hard}: complex mathematical problems with
        ground-truth \LaTeX{} solutions;
  \item \textbf{math-500}: problems from the MATH-500 benchmark,
        covering a broad range of mathematical topics;
  \item \textbf{bigcodebench}: code-generation tasks with
        executable test cases.
\end{itemize}
%
In all domains, the agent must both produce a correct solution
and respect the user's style preferences (e.g., level of detail,
formatting, language) across multiple sessions.

We run a large-scale evaluation with $60$ user profiles.
Each profile is run for $60$ sessions, with each session
allowing up to $10$ interaction turns, yielding up to
$36{,}000$ sessions in total per method.
All results in Section~\ref{sec:results} are reported on this
full set of profiles and sessions.

\subsection{Profiles and Style Preferences}
\label{sec:setup-profiles}

Each user profile encodes a rich set of style preferences as
condition--action rules.
Across the $60$ evaluation profiles, the average number of
preferences per profile is $43$, covering preferences about,
for example, how to present algebraic derivations, how much
intermediate reasoning to show, which language to use, and
how to format code output.

Preferences are stored in a structured JSON schema with explicit
identifiers, conflict groups, and priority cues; each preference
specifies when it applies (condition) and what the agent should do
(action), together with a conflict group and a small list of
priority-context keywords. A complete example profile is provided
in Appendix~\ref{app:profiles}.

In principle, the same schema can encode non-style
preferences such as topical interests, tool and API choices,
or safety constraints; in this work we instantiate it for
style preferences because the available user-simulation
benchmark provides rich style-oriented profiles and
automatic evaluation signals.

\subsection{Models and System Implementation}
\label{sec:setup-models}

We use an open-source LLM stack throughout.
The user simulator is
\texttt{Llama-3.3-70B-Instruct}~\cite{llama3},
served via \texttt{vLLM}~\cite{kwon2023vllm} with tensor
parallelism across two H200 GPUs (GPU~0--1).
The conversational agent is
\texttt{Llama-3.1-8B-Instruct}~\cite{llama3},
served via \texttt{vLLM} on a single H200 GPU (GPU~2).
For preference memory, we use
\texttt{Qwen3-Embedding-8B}~\cite{zhang2025qwen3embeddingadvancingtext}
as the dense embedding model and
\texttt{Qwen3-Reranker-8B}~\cite{zhang2025qwen3embeddingadvancingtext}
for scoring query--memory pairs; both share GPU~3.
The reward model for online user-vector updates
(Section~\ref{sec:method-rl}) uses a keyword-based heuristic
that classifies the user's follow-up message into reward
categories without an additional LLM call.
All backbone models (chat LLM, embedding model, reranker)
are frozen during evaluation; only the external preference
memory and user vectors are updated online.

\subsection{System Modes and Baselines}
\label{sec:setup-modes}

Because all backbone models are frozen, personalization
is realized entirely by the external preference memory and
user vectors.
We instantiate six system modes, corresponding to the
rows in Table~\ref{tab:msc-main}:

\begin{itemize}
  \item \textbf{Vanilla}:
    No preference extraction or memory.
    The agent answers each query using only the current
    session history, without any cross-session memory
    or user vectors.

  \item \textbf{Contextual}:
    No explicit preference extraction.
    The full conversation history is appended as context
    to the agent prompt, but there is no separate preference
    memory and no user vector.

  \item \textbf{All-memory}:
    Preference extraction and memory are enabled, but instead
    of retrieval, the system appends all previously extracted
    preference notes for the current user to the prompt at
    each turn.

  \item \textbf{RAG}:
    Preference extraction and memory are enabled.
    The system builds preference cards, retrieves them with
    dense retrieval, reranks them with a frozen reranker,
    and injects the selected notes into the prompt.
    There is \emph{no} user vector; retrieval is purely
    based on query--memory relevance.

  \item \textbf{RAG+Vector}:
    Our full method.
    Same as \textbf{RAG}, but with a learned user vector
    $z_u$ that biases retrieval scores via the user-aware
    scoring function in Section~\ref{sec:method-user-vector}.
    The long-term and short-term user vectors are updated
    online from weak feedback using the REINFORCE update
    in Section~\ref{sec:method-rl}.

  \item \textbf{Reflection}:
    A reasoning-based reflection baseline introduced in
    \textsc{MultiSessionCollab}.
    After each session, the agent generates a
    session-level reflection that includes a reasoning
    trace and explicit notes about the user's preferences,
    and these reflections are appended to the system prompt
    in future sessions.
    This requires an additional LLM pass per session and
    therefore incurs a higher token cost.
    In our implementation, the backbone remains frozen and
    we use prompted reflections without GRPO fine-tuning;
    reproducing the GRPO-trained reflection agents from the
    original paper is left to future work
    (Section~\ref{sec:discussion}).
\end{itemize}

For each system mode, we run on the same $60$ profiles and
$60$ sessions with identical task sequences.

\subsection{Evaluation Metrics}
\label{sec:setup-metrics}

We report a set of outcome, robustness, and interaction-efficiency
metrics derived from the interaction logs:

\begin{itemize}
  \item \textbf{Success rate}:
    The percentage of sessions that end in task success.

  \item \textbf{Timeout rate}:
    The fraction of sessions that terminate because the maximum
    number of turns is reached without task completion.

  \item \textbf{User effort (user tokens)}:
    The average number of user-side tokens consumed per session,
    used as a proxy for how much effort is required from the user
    to complete the task.

  \item \textbf{Successes per 1k user tokens}:
    An interaction-efficiency metric computed as the number of
    successes normalized by total user tokens (higher is better).

  \item \textbf{Late-session success}:
    Success rate restricted to later sessions (sessions 30--59),
    to probe long-horizon personalization behavior.

  \item \textbf{Quick resolution}:
    The fraction of sessions that succeed within at most 4 turns.

  \item \textbf{First-turn enforcement}:
    The fraction of sessions where the user enforces a preference
    on the first turn (lower is better, indicating the agent
    proactively applied the correct preferences).

  \item \textbf{Zero-enforcement success}:
    Success rate restricted to sessions with no explicit preference
    enforcement, which probes whether preferences are applied
    correctly without user intervention.
\end{itemize}

The main comparison table (Table~\ref{tab:msc-main}) reports
Success, Timeout, and User tokens; additional metrics are
reported in Table~\ref{tab:extra-metrics}.
```