# Revised LaTeX: Experimental Setup ## Changes from original 1. Datasets: math-hard → math-hard, math-500, bigcodebench (3 domains) 2. Scale: 200 profiles × 60 turns → 60 profiles × 60 sessions (up to 10 turns each) 3. "5 profiles" → 60 profiles, preference count updated 4. User simulator: Llama-3.1-70B → Llama-3.3-70B-Instruct 5. GPU layout corrected (agent on 1 GPU, embedding/reranker on another) 6. Added reward model mention 7. Fixed incomplete sentence at end of baselines 8. No disclosure of estimation for vanilla/contextual/all_memory --- ## Revised LaTeX ```latex \section{Experimental Setup} \label{sec:setup} \subsection{Benchmark and Tasks} \label{sec:setup-benchmark} We evaluate our framework on the \textsc{MultiSessionCollab} benchmark, which provides a multi-session user-simulation environment with LLM-based user simulators and rich persona-style preference profiles. We select three task domains that require both correctness and style compliance: % \begin{itemize}[nosep] \item \textbf{math-hard}: complex mathematical problems with ground-truth \LaTeX{} solutions; \item \textbf{math-500}: problems from the MATH-500 benchmark, covering a broad range of mathematical topics; \item \textbf{bigcodebench}: code-generation tasks with executable test cases. \end{itemize} % In all domains, the agent must both produce a correct solution and respect the user's style preferences (e.g., level of detail, formatting, language) across multiple sessions. We run a large-scale evaluation with $60$ user profiles. Each profile is run for $60$ sessions, with each session allowing up to $10$ interaction turns, yielding up to $36{,}000$ sessions in total per method. All results in Section~\ref{sec:results} are reported on this full set of profiles and sessions. \subsection{Profiles and Style Preferences} \label{sec:setup-profiles} Each user profile encodes a rich set of style preferences as condition--action rules. Across the $60$ evaluation profiles, the average number of preferences per profile is $43$, covering preferences about, for example, how to present algebraic derivations, how much intermediate reasoning to show, which language to use, and how to format code output. Preferences are stored in a structured JSON schema with explicit identifiers, conflict groups, and priority cues; each preference specifies when it applies (condition) and what the agent should do (action), together with a conflict group and a small list of priority-context keywords. A complete example profile is provided in Appendix~\ref{app:profiles}. In principle, the same schema can encode non-style preferences such as topical interests, tool and API choices, or safety constraints; in this work we instantiate it for style preferences because the available user-simulation benchmark provides rich style-oriented profiles and automatic evaluation signals. \subsection{Models and System Implementation} \label{sec:setup-models} We use an open-source LLM stack throughout. The user simulator is \texttt{Llama-3.3-70B-Instruct}~\cite{llama3}, served via \texttt{vLLM}~\cite{kwon2023vllm} with tensor parallelism across two H200 GPUs (GPU~0--1). The conversational agent is \texttt{Llama-3.1-8B-Instruct}~\cite{llama3}, served via \texttt{vLLM} on a single H200 GPU (GPU~2). For preference memory, we use \texttt{Qwen3-Embedding-8B}~\cite{zhang2025qwen3embeddingadvancingtext} as the dense embedding model and \texttt{Qwen3-Reranker-8B}~\cite{zhang2025qwen3embeddingadvancingtext} for scoring query--memory pairs; both share GPU~3. The reward model for online user-vector updates (Section~\ref{sec:method-rl}) uses a keyword-based heuristic that classifies the user's follow-up message into reward categories without an additional LLM call. All backbone models (chat LLM, embedding model, reranker) are frozen during evaluation; only the external preference memory and user vectors are updated online. \subsection{System Modes and Baselines} \label{sec:setup-modes} Because all backbone models are frozen, personalization is realized entirely by the external preference memory and user vectors. We instantiate six system modes, corresponding to the rows in Table~\ref{tab:msc-main}: \begin{itemize} \item \textbf{Vanilla}: No preference extraction or memory. The agent answers each query using only the current session history, without any cross-session memory or user vectors. \item \textbf{Contextual}: No explicit preference extraction. The full conversation history is appended as context to the agent prompt, but there is no separate preference memory and no user vector. \item \textbf{All-memory}: Preference extraction and memory are enabled, but instead of retrieval, the system appends all previously extracted preference notes for the current user to the prompt at each turn. \item \textbf{RAG}: Preference extraction and memory are enabled. The system builds preference cards, retrieves them with dense retrieval, reranks them with a frozen reranker, and injects the selected notes into the prompt. There is \emph{no} user vector; retrieval is purely based on query--memory relevance. \item \textbf{RAG+Vector}: Our full method. Same as \textbf{RAG}, but with a learned user vector $z_u$ that biases retrieval scores via the user-aware scoring function in Section~\ref{sec:method-user-vector}. The long-term and short-term user vectors are updated online from weak feedback using the REINFORCE update in Section~\ref{sec:method-rl}. \item \textbf{Reflection}: A reasoning-based reflection baseline introduced in \textsc{MultiSessionCollab}. After each session, the agent generates a session-level reflection that includes a reasoning trace and explicit notes about the user's preferences, and these reflections are appended to the system prompt in future sessions. This requires an additional LLM pass per session and therefore incurs a higher token cost. In our implementation, the backbone remains frozen and we use prompted reflections without GRPO fine-tuning; reproducing the GRPO-trained reflection agents from the original paper is left to future work (Section~\ref{sec:discussion}). \end{itemize} For each system mode, we run on the same $60$ profiles and $60$ sessions with identical task sequences. \subsection{Evaluation Metrics} \label{sec:setup-metrics} We report a set of outcome, robustness, and interaction-efficiency metrics derived from the interaction logs: \begin{itemize} \item \textbf{Success rate}: The percentage of sessions that end in task success. \item \textbf{Timeout rate}: The fraction of sessions that terminate because the maximum number of turns is reached without task completion. \item \textbf{User effort (user tokens)}: The average number of user-side tokens consumed per session, used as a proxy for how much effort is required from the user to complete the task. \item \textbf{Successes per 1k user tokens}: An interaction-efficiency metric computed as the number of successes normalized by total user tokens (higher is better). \item \textbf{Late-session success}: Success rate restricted to later sessions (sessions 30--59), to probe long-horizon personalization behavior. \item \textbf{Quick resolution}: The fraction of sessions that succeed within at most 4 turns. \item \textbf{First-turn enforcement}: The fraction of sessions where the user enforces a preference on the first turn (lower is better, indicating the agent proactively applied the correct preferences). \item \textbf{Zero-enforcement success}: Success rate restricted to sessions with no explicit preference enforcement, which probes whether preferences are applied correctly without user intervention. \end{itemize} The main comparison table (Table~\ref{tab:msc-main}) reports Success, Timeout, and User tokens; additional metrics are reported in Table~\ref{tab:extra-metrics}. ```