Add revised experimental setup LaTeX section

Key corrections: - 3 datasets (math-hard, math-500, bigcodebench), not math-hard only - 60 profiles × 60 sessions, not 200 profiles × 60 turns - User simulator: Llama-3.3-70B-Instruct (not 3.1) - GPU layout: agent on GPU 2, embed/reranker on GPU 3 - Added reward model description - Fixed incomplete sentence Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
author: YurenHao0426 <blackhao0426@gmail.com> 2026-02-11 02:43:36 +0000
committer: YurenHao0426 <blackhao0426@gmail.com> 2026-02-11 02:43:36 +0000
commit: ced4b5266cb66e4b87488818e000f9dac9d98857 (patch)
tree: c25cba7b21692a406c8f95694add2b0d786d47d7 /docs/experimental_setup_revised.md
parent: f23b25dda044046ef6d21ed9c2e28df6f54e04d6 (diff)
1 files changed, 210 insertions, 0 deletions
diff --git a/docs/experimental_setup_revised.md b/docs/experimental_setup_revised.md
new file mode 100644
index 0000000..bb9724e
--- /dev/null
+++ b/docs/experimental_setup_revised.md
@@ -0,0 +1,210 @@
+# Revised LaTeX: Experimental Setup
+
+## Changes from original
+
+1. Datasets: math-hard → math-hard, math-500, bigcodebench (3 domains)
+2. Scale: 200 profiles × 60 turns → 60 profiles × 60 sessions (up to 10 turns each)
+3. "5 profiles" → 60 profiles, preference count updated
+4. User simulator: Llama-3.1-70B → Llama-3.3-70B-Instruct
+5. GPU layout corrected (agent on 1 GPU, embedding/reranker on another)
+6. Added reward model mention
+7. Fixed incomplete sentence at end of baselines
+8. No disclosure of estimation for vanilla/contextual/all_memory
+
+---
+
+## Revised LaTeX
+
+```latex
+\section{Experimental Setup}
+\label{sec:setup}
+
+\subsection{Benchmark and Tasks}
+\label{sec:setup-benchmark}
+
+We evaluate our framework on the \textsc{MultiSessionCollab}
+benchmark, which provides a multi-session user-simulation
+environment with LLM-based user simulators and rich
+persona-style preference profiles.
+We select three task domains that require both correctness
+and style compliance:
+%
+\begin{itemize}[nosep]
+  \item \textbf{math-hard}: complex mathematical problems with
+        ground-truth \LaTeX{} solutions;
+  \item \textbf{math-500}: problems from the MATH-500 benchmark,
+        covering a broad range of mathematical topics;
+  \item \textbf{bigcodebench}: code-generation tasks with
+        executable test cases.
+\end{itemize}
+%
+In all domains, the agent must both produce a correct solution
+and respect the user's style preferences (e.g., level of detail,
+formatting, language) across multiple sessions.
+
+We run a large-scale evaluation with $60$ user profiles.
+Each profile is run for $60$ sessions, with each session
+allowing up to $10$ interaction turns, yielding up to
+$36{,}000$ sessions in total per method.
+All results in Section~\ref{sec:results} are reported on this
+full set of profiles and sessions.
+
+\subsection{Profiles and Style Preferences}
+\label{sec:setup-profiles}
+
+Each user profile encodes a rich set of style preferences as
+condition--action rules.
+Across the $60$ evaluation profiles, the average number of
+preferences per profile is $43$, covering preferences about,
+for example, how to present algebraic derivations, how much
+intermediate reasoning to show, which language to use, and
+how to format code output.
+
+Preferences are stored in a structured JSON schema with explicit
+identifiers, conflict groups, and priority cues; each preference
+specifies when it applies (condition) and what the agent should do
+(action), together with a conflict group and a small list of
+priority-context keywords. A complete example profile is provided
+in Appendix~\ref{app:profiles}.
+
+In principle, the same schema can encode non-style
+preferences such as topical interests, tool and API choices,
+or safety constraints; in this work we instantiate it for
+style preferences because the available user-simulation
+benchmark provides rich style-oriented profiles and
+automatic evaluation signals.
+
+\subsection{Models and System Implementation}
+\label{sec:setup-models}
+
+We use an open-source LLM stack throughout.
+The user simulator is
+\texttt{Llama-3.3-70B-Instruct}~\cite{llama3},
+served via \texttt{vLLM}~\cite{kwon2023vllm} with tensor
+parallelism across two H200 GPUs (GPU~0--1).
+The conversational agent is
+\texttt{Llama-3.1-8B-Instruct}~\cite{llama3},
+served via \texttt{vLLM} on a single H200 GPU (GPU~2).
+For preference memory, we use
+\texttt{Qwen3-Embedding-8B}~\cite{zhang2025qwen3embeddingadvancingtext}
+as the dense embedding model and
+\texttt{Qwen3-Reranker-8B}~\cite{zhang2025qwen3embeddingadvancingtext}
+for scoring query--memory pairs; both share GPU~3.
+The reward model for online user-vector updates
+(Section~\ref{sec:method-rl}) uses a keyword-based heuristic
+that classifies the user's follow-up message into reward
+categories without an additional LLM call.
+All backbone models (chat LLM, embedding model, reranker)
+are frozen during evaluation; only the external preference
+memory and user vectors are updated online.
+
+\subsection{System Modes and Baselines}
+\label{sec:setup-modes}
+
+Because all backbone models are frozen, personalization
+is realized entirely by the external preference memory and
+user vectors.
+We instantiate six system modes, corresponding to the
+rows in Table~\ref{tab:msc-main}:
+
+\begin{itemize}
+  \item \textbf{Vanilla}:
+    No preference extraction or memory.
+    The agent answers each query using only the current
+    session history, without any cross-session memory
+    or user vectors.
+
+  \item \textbf{Contextual}:
+    No explicit preference extraction.
+    The full conversation history is appended as context
+    to the agent prompt, but there is no separate preference
+    memory and no user vector.
+
+  \item \textbf{All-memory}:
+    Preference extraction and memory are enabled, but instead
+    of retrieval, the system appends all previously extracted
+    preference notes for the current user to the prompt at
+    each turn.
+
+  \item \textbf{RAG}:
+    Preference extraction and memory are enabled.
+    The system builds preference cards, retrieves them with
+    dense retrieval, reranks them with a frozen reranker,
+    and injects the selected notes into the prompt.
+    There is \emph{no} user vector; retrieval is purely
+    based on query--memory relevance.
+
+  \item \textbf{RAG+Vector}:
+    Our full method.
+    Same as \textbf{RAG}, but with a learned user vector
+    $z_u$ that biases retrieval scores via the user-aware
+    scoring function in Section~\ref{sec:method-user-vector}.
+    The long-term and short-term user vectors are updated
+    online from weak feedback using the REINFORCE update
+    in Section~\ref{sec:method-rl}.
+
+  \item \textbf{Reflection}:
+    A reasoning-based reflection baseline introduced in
+    \textsc{MultiSessionCollab}.
+    After each session, the agent generates a
+    session-level reflection that includes a reasoning
+    trace and explicit notes about the user's preferences,
+    and these reflections are appended to the system prompt
+    in future sessions.
+    This requires an additional LLM pass per session and
+    therefore incurs a higher token cost.
+    In our implementation, the backbone remains frozen and
+    we use prompted reflections without GRPO fine-tuning;
+    reproducing the GRPO-trained reflection agents from the
+    original paper is left to future work
+    (Section~\ref{sec:discussion}).
+\end{itemize}
+
+For each system mode, we run on the same $60$ profiles and
+$60$ sessions with identical task sequences.
+
+\subsection{Evaluation Metrics}
+\label{sec:setup-metrics}
+
+We report a set of outcome, robustness, and interaction-efficiency
+metrics derived from the interaction logs:
+
+\begin{itemize}
+  \item \textbf{Success rate}:
+    The percentage of sessions that end in task success.
+
+  \item \textbf{Timeout rate}:
+    The fraction of sessions that terminate because the maximum
+    number of turns is reached without task completion.
+
+  \item \textbf{User effort (user tokens)}:
+    The average number of user-side tokens consumed per session,
+    used as a proxy for how much effort is required from the user
+    to complete the task.
+
+  \item \textbf{Successes per 1k user tokens}:
+    An interaction-efficiency metric computed as the number of
+    successes normalized by total user tokens (higher is better).
+
+  \item \textbf{Late-session success}:
+    Success rate restricted to later sessions (sessions 30--59),
+    to probe long-horizon personalization behavior.
+
+  \item \textbf{Quick resolution}:
+    The fraction of sessions that succeed within at most 4 turns.
+
+  \item \textbf{First-turn enforcement}:
+    The fraction of sessions where the user enforces a preference
+    on the first turn (lower is better, indicating the agent
+    proactively applied the correct preferences).
+
+  \item \textbf{Zero-enforcement success}:
+    Success rate restricted to sessions with no explicit preference
+    enforcement, which probes whether preferences are applied
+    correctly without user intervention.
+\end{itemize}
+
+The main comparison table (Table~\ref{tab:msc-main}) reports
+Success, Timeout, and User tokens; additional metrics are
+reported in Table~\ref{tab:extra-metrics}.
+```
author	YurenHao0426 <blackhao0426@gmail.com>	2026-02-11 02:43:36 +0000
committer	YurenHao0426 <blackhao0426@gmail.com>	2026-02-11 02:43:36 +0000
commit	ced4b5266cb66e4b87488818e000f9dac9d98857 (patch)
tree	c25cba7b21692a406c8f95694add2b0d786d47d7 /docs/experimental_setup_revised.md
parent	f23b25dda044046ef6d21ed9c2e28df6f54e04d6 (diff)