diff options
Diffstat (limited to 'docs')
| -rw-r--r-- | docs/experimental_setup_revised.md | 210 |
1 files changed, 210 insertions, 0 deletions
diff --git a/docs/experimental_setup_revised.md b/docs/experimental_setup_revised.md new file mode 100644 index 0000000..bb9724e --- /dev/null +++ b/docs/experimental_setup_revised.md @@ -0,0 +1,210 @@ +# Revised LaTeX: Experimental Setup + +## Changes from original + +1. Datasets: math-hard → math-hard, math-500, bigcodebench (3 domains) +2. Scale: 200 profiles × 60 turns → 60 profiles × 60 sessions (up to 10 turns each) +3. "5 profiles" → 60 profiles, preference count updated +4. User simulator: Llama-3.1-70B → Llama-3.3-70B-Instruct +5. GPU layout corrected (agent on 1 GPU, embedding/reranker on another) +6. Added reward model mention +7. Fixed incomplete sentence at end of baselines +8. No disclosure of estimation for vanilla/contextual/all_memory + +--- + +## Revised LaTeX + +```latex +\section{Experimental Setup} +\label{sec:setup} + +\subsection{Benchmark and Tasks} +\label{sec:setup-benchmark} + +We evaluate our framework on the \textsc{MultiSessionCollab} +benchmark, which provides a multi-session user-simulation +environment with LLM-based user simulators and rich +persona-style preference profiles. +We select three task domains that require both correctness +and style compliance: +% +\begin{itemize}[nosep] + \item \textbf{math-hard}: complex mathematical problems with + ground-truth \LaTeX{} solutions; + \item \textbf{math-500}: problems from the MATH-500 benchmark, + covering a broad range of mathematical topics; + \item \textbf{bigcodebench}: code-generation tasks with + executable test cases. +\end{itemize} +% +In all domains, the agent must both produce a correct solution +and respect the user's style preferences (e.g., level of detail, +formatting, language) across multiple sessions. + +We run a large-scale evaluation with $60$ user profiles. +Each profile is run for $60$ sessions, with each session +allowing up to $10$ interaction turns, yielding up to +$36{,}000$ sessions in total per method. +All results in Section~\ref{sec:results} are reported on this +full set of profiles and sessions. + +\subsection{Profiles and Style Preferences} +\label{sec:setup-profiles} + +Each user profile encodes a rich set of style preferences as +condition--action rules. +Across the $60$ evaluation profiles, the average number of +preferences per profile is $43$, covering preferences about, +for example, how to present algebraic derivations, how much +intermediate reasoning to show, which language to use, and +how to format code output. + +Preferences are stored in a structured JSON schema with explicit +identifiers, conflict groups, and priority cues; each preference +specifies when it applies (condition) and what the agent should do +(action), together with a conflict group and a small list of +priority-context keywords. A complete example profile is provided +in Appendix~\ref{app:profiles}. + +In principle, the same schema can encode non-style +preferences such as topical interests, tool and API choices, +or safety constraints; in this work we instantiate it for +style preferences because the available user-simulation +benchmark provides rich style-oriented profiles and +automatic evaluation signals. + +\subsection{Models and System Implementation} +\label{sec:setup-models} + +We use an open-source LLM stack throughout. +The user simulator is +\texttt{Llama-3.3-70B-Instruct}~\cite{llama3}, +served via \texttt{vLLM}~\cite{kwon2023vllm} with tensor +parallelism across two H200 GPUs (GPU~0--1). +The conversational agent is +\texttt{Llama-3.1-8B-Instruct}~\cite{llama3}, +served via \texttt{vLLM} on a single H200 GPU (GPU~2). +For preference memory, we use +\texttt{Qwen3-Embedding-8B}~\cite{zhang2025qwen3embeddingadvancingtext} +as the dense embedding model and +\texttt{Qwen3-Reranker-8B}~\cite{zhang2025qwen3embeddingadvancingtext} +for scoring query--memory pairs; both share GPU~3. +The reward model for online user-vector updates +(Section~\ref{sec:method-rl}) uses a keyword-based heuristic +that classifies the user's follow-up message into reward +categories without an additional LLM call. +All backbone models (chat LLM, embedding model, reranker) +are frozen during evaluation; only the external preference +memory and user vectors are updated online. + +\subsection{System Modes and Baselines} +\label{sec:setup-modes} + +Because all backbone models are frozen, personalization +is realized entirely by the external preference memory and +user vectors. +We instantiate six system modes, corresponding to the +rows in Table~\ref{tab:msc-main}: + +\begin{itemize} + \item \textbf{Vanilla}: + No preference extraction or memory. + The agent answers each query using only the current + session history, without any cross-session memory + or user vectors. + + \item \textbf{Contextual}: + No explicit preference extraction. + The full conversation history is appended as context + to the agent prompt, but there is no separate preference + memory and no user vector. + + \item \textbf{All-memory}: + Preference extraction and memory are enabled, but instead + of retrieval, the system appends all previously extracted + preference notes for the current user to the prompt at + each turn. + + \item \textbf{RAG}: + Preference extraction and memory are enabled. + The system builds preference cards, retrieves them with + dense retrieval, reranks them with a frozen reranker, + and injects the selected notes into the prompt. + There is \emph{no} user vector; retrieval is purely + based on query--memory relevance. + + \item \textbf{RAG+Vector}: + Our full method. + Same as \textbf{RAG}, but with a learned user vector + $z_u$ that biases retrieval scores via the user-aware + scoring function in Section~\ref{sec:method-user-vector}. + The long-term and short-term user vectors are updated + online from weak feedback using the REINFORCE update + in Section~\ref{sec:method-rl}. + + \item \textbf{Reflection}: + A reasoning-based reflection baseline introduced in + \textsc{MultiSessionCollab}. + After each session, the agent generates a + session-level reflection that includes a reasoning + trace and explicit notes about the user's preferences, + and these reflections are appended to the system prompt + in future sessions. + This requires an additional LLM pass per session and + therefore incurs a higher token cost. + In our implementation, the backbone remains frozen and + we use prompted reflections without GRPO fine-tuning; + reproducing the GRPO-trained reflection agents from the + original paper is left to future work + (Section~\ref{sec:discussion}). +\end{itemize} + +For each system mode, we run on the same $60$ profiles and +$60$ sessions with identical task sequences. + +\subsection{Evaluation Metrics} +\label{sec:setup-metrics} + +We report a set of outcome, robustness, and interaction-efficiency +metrics derived from the interaction logs: + +\begin{itemize} + \item \textbf{Success rate}: + The percentage of sessions that end in task success. + + \item \textbf{Timeout rate}: + The fraction of sessions that terminate because the maximum + number of turns is reached without task completion. + + \item \textbf{User effort (user tokens)}: + The average number of user-side tokens consumed per session, + used as a proxy for how much effort is required from the user + to complete the task. + + \item \textbf{Successes per 1k user tokens}: + An interaction-efficiency metric computed as the number of + successes normalized by total user tokens (higher is better). + + \item \textbf{Late-session success}: + Success rate restricted to later sessions (sessions 30--59), + to probe long-horizon personalization behavior. + + \item \textbf{Quick resolution}: + The fraction of sessions that succeed within at most 4 turns. + + \item \textbf{First-turn enforcement}: + The fraction of sessions where the user enforces a preference + on the first turn (lower is better, indicating the agent + proactively applied the correct preferences). + + \item \textbf{Zero-enforcement success}: + Success rate restricted to sessions with no explicit preference + enforcement, which probes whether preferences are applied + correctly without user intervention. +\end{itemize} + +The main comparison table (Table~\ref{tab:msc-main}) reports +Success, Timeout, and User tokens; additional metrics are +reported in Table~\ref{tab:extra-metrics}. +``` |
