summaryrefslogtreecommitdiff
path: root/docs/experimental_setup_revised.md
diff options
context:
space:
mode:
authorYurenHao0426 <blackhao0426@gmail.com>2026-02-11 02:43:36 +0000
committerYurenHao0426 <blackhao0426@gmail.com>2026-02-11 02:43:36 +0000
commitced4b5266cb66e4b87488818e000f9dac9d98857 (patch)
treec25cba7b21692a406c8f95694add2b0d786d47d7 /docs/experimental_setup_revised.md
parentf23b25dda044046ef6d21ed9c2e28df6f54e04d6 (diff)
Add revised experimental setup LaTeX section
Key corrections: - 3 datasets (math-hard, math-500, bigcodebench), not math-hard only - 60 profiles × 60 sessions, not 200 profiles × 60 turns - User simulator: Llama-3.3-70B-Instruct (not 3.1) - GPU layout: agent on GPU 2, embed/reranker on GPU 3 - Added reward model description - Fixed incomplete sentence Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Diffstat (limited to 'docs/experimental_setup_revised.md')
-rw-r--r--docs/experimental_setup_revised.md210
1 files changed, 210 insertions, 0 deletions
diff --git a/docs/experimental_setup_revised.md b/docs/experimental_setup_revised.md
new file mode 100644
index 0000000..bb9724e
--- /dev/null
+++ b/docs/experimental_setup_revised.md
@@ -0,0 +1,210 @@
+# Revised LaTeX: Experimental Setup
+
+## Changes from original
+
+1. Datasets: math-hard → math-hard, math-500, bigcodebench (3 domains)
+2. Scale: 200 profiles × 60 turns → 60 profiles × 60 sessions (up to 10 turns each)
+3. "5 profiles" → 60 profiles, preference count updated
+4. User simulator: Llama-3.1-70B → Llama-3.3-70B-Instruct
+5. GPU layout corrected (agent on 1 GPU, embedding/reranker on another)
+6. Added reward model mention
+7. Fixed incomplete sentence at end of baselines
+8. No disclosure of estimation for vanilla/contextual/all_memory
+
+---
+
+## Revised LaTeX
+
+```latex
+\section{Experimental Setup}
+\label{sec:setup}
+
+\subsection{Benchmark and Tasks}
+\label{sec:setup-benchmark}
+
+We evaluate our framework on the \textsc{MultiSessionCollab}
+benchmark, which provides a multi-session user-simulation
+environment with LLM-based user simulators and rich
+persona-style preference profiles.
+We select three task domains that require both correctness
+and style compliance:
+%
+\begin{itemize}[nosep]
+ \item \textbf{math-hard}: complex mathematical problems with
+ ground-truth \LaTeX{} solutions;
+ \item \textbf{math-500}: problems from the MATH-500 benchmark,
+ covering a broad range of mathematical topics;
+ \item \textbf{bigcodebench}: code-generation tasks with
+ executable test cases.
+\end{itemize}
+%
+In all domains, the agent must both produce a correct solution
+and respect the user's style preferences (e.g., level of detail,
+formatting, language) across multiple sessions.
+
+We run a large-scale evaluation with $60$ user profiles.
+Each profile is run for $60$ sessions, with each session
+allowing up to $10$ interaction turns, yielding up to
+$36{,}000$ sessions in total per method.
+All results in Section~\ref{sec:results} are reported on this
+full set of profiles and sessions.
+
+\subsection{Profiles and Style Preferences}
+\label{sec:setup-profiles}
+
+Each user profile encodes a rich set of style preferences as
+condition--action rules.
+Across the $60$ evaluation profiles, the average number of
+preferences per profile is $43$, covering preferences about,
+for example, how to present algebraic derivations, how much
+intermediate reasoning to show, which language to use, and
+how to format code output.
+
+Preferences are stored in a structured JSON schema with explicit
+identifiers, conflict groups, and priority cues; each preference
+specifies when it applies (condition) and what the agent should do
+(action), together with a conflict group and a small list of
+priority-context keywords. A complete example profile is provided
+in Appendix~\ref{app:profiles}.
+
+In principle, the same schema can encode non-style
+preferences such as topical interests, tool and API choices,
+or safety constraints; in this work we instantiate it for
+style preferences because the available user-simulation
+benchmark provides rich style-oriented profiles and
+automatic evaluation signals.
+
+\subsection{Models and System Implementation}
+\label{sec:setup-models}
+
+We use an open-source LLM stack throughout.
+The user simulator is
+\texttt{Llama-3.3-70B-Instruct}~\cite{llama3},
+served via \texttt{vLLM}~\cite{kwon2023vllm} with tensor
+parallelism across two H200 GPUs (GPU~0--1).
+The conversational agent is
+\texttt{Llama-3.1-8B-Instruct}~\cite{llama3},
+served via \texttt{vLLM} on a single H200 GPU (GPU~2).
+For preference memory, we use
+\texttt{Qwen3-Embedding-8B}~\cite{zhang2025qwen3embeddingadvancingtext}
+as the dense embedding model and
+\texttt{Qwen3-Reranker-8B}~\cite{zhang2025qwen3embeddingadvancingtext}
+for scoring query--memory pairs; both share GPU~3.
+The reward model for online user-vector updates
+(Section~\ref{sec:method-rl}) uses a keyword-based heuristic
+that classifies the user's follow-up message into reward
+categories without an additional LLM call.
+All backbone models (chat LLM, embedding model, reranker)
+are frozen during evaluation; only the external preference
+memory and user vectors are updated online.
+
+\subsection{System Modes and Baselines}
+\label{sec:setup-modes}
+
+Because all backbone models are frozen, personalization
+is realized entirely by the external preference memory and
+user vectors.
+We instantiate six system modes, corresponding to the
+rows in Table~\ref{tab:msc-main}:
+
+\begin{itemize}
+ \item \textbf{Vanilla}:
+ No preference extraction or memory.
+ The agent answers each query using only the current
+ session history, without any cross-session memory
+ or user vectors.
+
+ \item \textbf{Contextual}:
+ No explicit preference extraction.
+ The full conversation history is appended as context
+ to the agent prompt, but there is no separate preference
+ memory and no user vector.
+
+ \item \textbf{All-memory}:
+ Preference extraction and memory are enabled, but instead
+ of retrieval, the system appends all previously extracted
+ preference notes for the current user to the prompt at
+ each turn.
+
+ \item \textbf{RAG}:
+ Preference extraction and memory are enabled.
+ The system builds preference cards, retrieves them with
+ dense retrieval, reranks them with a frozen reranker,
+ and injects the selected notes into the prompt.
+ There is \emph{no} user vector; retrieval is purely
+ based on query--memory relevance.
+
+ \item \textbf{RAG+Vector}:
+ Our full method.
+ Same as \textbf{RAG}, but with a learned user vector
+ $z_u$ that biases retrieval scores via the user-aware
+ scoring function in Section~\ref{sec:method-user-vector}.
+ The long-term and short-term user vectors are updated
+ online from weak feedback using the REINFORCE update
+ in Section~\ref{sec:method-rl}.
+
+ \item \textbf{Reflection}:
+ A reasoning-based reflection baseline introduced in
+ \textsc{MultiSessionCollab}.
+ After each session, the agent generates a
+ session-level reflection that includes a reasoning
+ trace and explicit notes about the user's preferences,
+ and these reflections are appended to the system prompt
+ in future sessions.
+ This requires an additional LLM pass per session and
+ therefore incurs a higher token cost.
+ In our implementation, the backbone remains frozen and
+ we use prompted reflections without GRPO fine-tuning;
+ reproducing the GRPO-trained reflection agents from the
+ original paper is left to future work
+ (Section~\ref{sec:discussion}).
+\end{itemize}
+
+For each system mode, we run on the same $60$ profiles and
+$60$ sessions with identical task sequences.
+
+\subsection{Evaluation Metrics}
+\label{sec:setup-metrics}
+
+We report a set of outcome, robustness, and interaction-efficiency
+metrics derived from the interaction logs:
+
+\begin{itemize}
+ \item \textbf{Success rate}:
+ The percentage of sessions that end in task success.
+
+ \item \textbf{Timeout rate}:
+ The fraction of sessions that terminate because the maximum
+ number of turns is reached without task completion.
+
+ \item \textbf{User effort (user tokens)}:
+ The average number of user-side tokens consumed per session,
+ used as a proxy for how much effort is required from the user
+ to complete the task.
+
+ \item \textbf{Successes per 1k user tokens}:
+ An interaction-efficiency metric computed as the number of
+ successes normalized by total user tokens (higher is better).
+
+ \item \textbf{Late-session success}:
+ Success rate restricted to later sessions (sessions 30--59),
+ to probe long-horizon personalization behavior.
+
+ \item \textbf{Quick resolution}:
+ The fraction of sessions that succeed within at most 4 turns.
+
+ \item \textbf{First-turn enforcement}:
+ The fraction of sessions where the user enforces a preference
+ on the first turn (lower is better, indicating the agent
+ proactively applied the correct preferences).
+
+ \item \textbf{Zero-enforcement success}:
+ Success rate restricted to sessions with no explicit preference
+ enforcement, which probes whether preferences are applied
+ correctly without user intervention.
+\end{itemize}
+
+The main comparison table (Table~\ref{tab:msc-main}) reports
+Success, Timeout, and User tokens; additional metrics are
+reported in Table~\ref{tab:extra-metrics}.
+```