# Revised LaTeX: Reward Modeling and Online User Adaptation Below is the revised `\subsection{Reward Modeling and Online User Adaptation}` that matches the actual code implementation. ## Changes from original 1. **Input**: Removed $A_t$ from reward model input — code only uses $(q_t, a_t, q_{t+1})$ 2. **Classifier**: Replaced abstract $C_{\text{reward}}$ / $C_{\text{gate}}$ with single 7-label LLM-as-judge classifier + fixed reward mapping 3. **Gating**: Changed from "memory attribution" to confidence-based thresholding ($g_t = c_t$ when $c_t \geq \tau_c$, else 0) 4. **Model**: Explicitly describes Llama-3.1-8B-Instruct as the judge model 5. **REINFORCE section**: Kept as-is (already correct) --- ## Revised LaTeX ```latex \subsection{Reward Modeling and Online User Adaptation} \label{sec:method-rl} We do not assume a specific type of feedback such as thumbs-up/down. Instead, we treat all available interaction signals as being mapped into a \emph{scalar reward} that measures how helpful the system was for the user at a given turn~\cite{williams1992simple}. In our framework, this scalar reward is the only interface between feedback and the user-vector update: different applications can plug in different feedback sources as long as they can be converted into a real-valued reward (e.g., explicit ratings, binary ``good/bad'' labels, dwell time, click-through, or task-completion signals). \textbf{In this report, we instantiate this general interface with a specific form of weak feedback available in our MultiSessionCollab case study.} Concretely, we use an automated LLM judge that observes the user's follow-up message $q_{t+1}$ to derive a scalar reward for the previous assistant response $a_t$. If the user continues the task smoothly without restating preferences, the response is treated as more rewarding; if they restate preferences or express dissatisfaction, the response is treated as less rewarding. \paragraph{LLM-as-judge reward model.} A classifier $C$ takes as input the triple $(q_t, a_t, q_{t+1})$---the user query, assistant response, and user follow-up---and classifies the follow-up into one of seven mutually exclusive labels: % \begin{itemize}[nosep] \item \texttt{neg\_constraint\_restate}: user reasserts previously stated preferences ($r = {-}1.0$); \item \texttt{neg\_correction}: user indicates the response is incorrect ($r = {-}0.8$); \item \texttt{neg\_confusion}: user expresses confusion or requests re-explanation ($r = {-}0.6$); \item \texttt{pos\_praise}: explicit satisfaction or praise ($r = {+}0.8$); \item \texttt{pos\_progress}: constructive continuation such as follow-up questions or extensions ($r = {+}0.1$); \item \texttt{neutral}: ambiguous or minimal feedback ($r = 0$); \item \texttt{topic\_shift}: user switches to an unrelated topic (update skipped). \end{itemize} % Each label is mapped to a fixed scalar reward $\hat{r}_t \in \{-1.0, -0.8, -0.6, 0, +0.1, +0.8\}$ via a predefined reward map. The classifier also outputs a confidence score $c_t \in [0, 1]$. \paragraph{Confidence gating.} We gate the update using the classifier's own confidence. Define \[ g_t = \begin{cases} c_t & \text{if } c_t \geq \tau_c \text{ and label} \neq \texttt{topic\_shift}, \\ 0 & \text{otherwise}, \end{cases} \] where $\tau_c = 0.6$ is a confidence threshold. When $g_t = 0$ the reward is zeroed out and no user-vector update is performed. This prevents noisy or ambiguous signals from corrupting the learned user representation. In our experiments we instantiate $C$ with a Llama-3.1-8B-Instruct model served via vLLM, using structured JSON output. The judge receives only the $(q_t, a_t, q_{t+1})$ triple with no access to the user's preference profile or retrieved memories, ensuring that the reward signal reflects observable interaction quality rather than oracle information. We maintain a running baseline $b_u$ for each user to approximate their average reward: \[ b_u \leftarrow (1 - \alpha) b_u + \alpha \hat{r}_t, \] with a small smoothing coefficient $\alpha$. \paragraph{REINFORCE update.} Let $A_t \subseteq M_t$ be the set of memories actually used at time $t$ (e.g., the top-$k$ items under $s(\cdot)$). We define the average item vector of chosen memories: \[ v_{\text{chosen},t} = \frac{1}{|A_t|} \sum_{m_i \in A_t} v_{m_i}, \] and the policy expectation in item space: \[ \mu_t = \mathbb{E}_{m \sim \pi_z}[v_m] = \sum_{i=1}^{K} \pi_z(m_i \mid q_t, u) v_{m_i}. \] We then define a retrieval-specific advantage: \[ A^{\text{retr}}_t = g_t \,\bigl(\hat{r}_t - b_u\bigr), \] which down-weights updates when the classifier confidence is low (small $g_t$) or when the reward is close to the user's baseline performance. The user vectors are updated with a REINFORCE-style rule: \[ \Delta z^{(L)}_u = \eta_L \frac{A^{\text{retr}}_t}{\tau}\, (v_{\text{chosen},t} - \mu_t), \] \[ \Delta z^{(S)}_{u,t} = \eta_S \frac{A^{\text{retr}}_t}{\tau}\, (v_{\text{chosen},t} - \mu_t), \] with learning rates $\eta_L,\eta_S$. The short-term vector further undergoes exponential decay: \[ z^{(S)}_{u,t+1} = (1 - \lambda) z^{(S)}_{u,t} + \Delta z^{(S)}_{u,t}, \] for some decay rate $\lambda \in (0, 1)$. Positive advantage ($A^{\text{retr}}_t > 0$) moves the user vectors toward the directions of the actually used memories, while negative advantage pushes them away. Because the backbone LLM, embedding model, and reranker are frozen, the entire personalization signal is captured by these low-dimensional user vectors and the external preference memory. The scalar-reward interface ensures that, in future applications, we can swap in richer feedback sources (explicit or implicit) without redesigning the update rule. Intuitively, users who repeatedly make similar retrieval decisions under similar reward profiles will accumulate updates in similar directions in the item space, similar in spirit to classic collaborative filtering~\cite{hu2008implicitcf,koren2009matrix}, causing their long-term vectors $z^{(L)}_u$ to cluster; Section~\ref{sec:results} analyzes this emergent geometry and its relationship to ground-truth preference overlap. ```