# Revised LaTeX: Reward Modeling and Online User Adaptation Describes the keyword-based heuristic reward model (matching actual experiments). ## Key design 1. **Reward**: keyword matching on $q_{t+1}$ (negative/positive) + topic coherence via embedding cosine similarity 2. **Gating**: separate heuristic based on memory-query similarity → attributes reward to retrieval vs generation 3. **No additional model**: lightweight, runs without GPU 4. **REINFORCE update**: unchanged --- ## Revised LaTeX ```latex \subsection{Reward Modeling and Online User Adaptation} \label{sec:method-rl} We do not assume a specific type of feedback such as thumbs-up/down. Instead, we treat all available interaction signals as being mapped into a \emph{scalar reward} that measures how helpful the system was for the user at a given turn~\cite{williams1992simple}. In our framework, this scalar reward is the only interface between feedback and the user-vector update: different applications can plug in different feedback sources as long as they can be converted into a real-valued reward (e.g., explicit ratings, binary ``good/bad'' labels, dwell time, click-through, or task-completion signals). \textbf{In this report, we instantiate this general interface with a specific form of weak feedback available in our MultiSessionCollab case study.} Concretely, we observe the user's follow-up message $q_{t+1}$ to derive a scalar reward for the previous assistant response $a_t$. If the user continues the task smoothly without restating preferences, the response is treated as more rewarding; if they restate preferences or express dissatisfaction, the response is treated as less rewarding. \paragraph{Keyword-based reward estimation.} The reward function $\hat{r}_t$ is computed from the user's follow-up $q_{t+1}$ using two lightweight signals: % \begin{enumerate}[nosep] \item \emph{Sentiment keywords.} We maintain curated lists of negative indicators (e.g., ``incorrect'', ``redo'', ``not what I asked'') and positive indicators (e.g., ``thanks'', ``continue'', ``what if''). A negative match contributes $-1.0$ to the reward; a positive match contributes up to $+1.0$. \item \emph{Topic coherence.} The cosine similarity between the query embeddings $e_{q_t}$ and $e_{q_{t+1}}$ detects topic shifts. When topic similarity is low ($< 0.2$), the reward is dampened, since a topic change makes it ambiguous whether the previous response was satisfactory or the user gave up. \end{enumerate} % The final reward is clipped to $[-1, 1]$. This heuristic requires no additional model or API call, making it fast and suitable for large-scale evaluation. \paragraph{Retrieval-attribution gating.} Not all feedback is informative for updating the retrieval policy: a negative reward caused by the LLM's generation quality (rather than poor memory retrieval) should not penalize the user vector. We therefore compute a gating scalar $g_t \in [0, 1]$ that estimates how much of the observed reward should be attributed to the retrieval component. Let $s^{\max}_q = \max_i \cos(e_{m_i}, e_{q_t})$ be the maximum cosine similarity between any retrieved memory embedding and the current query. The gating heuristic distinguishes three cases: % \begin{itemize}[nosep] \item If the reward is strongly negative and no memory is similar to either query ($s^{\max}_q < 0.2$), retrieval is likely at fault (failed to surface relevant preferences): $g_t = 0.9$. \item If the reward is strongly negative but a relevant memory was retrieved ($s^{\max}_q > 0.5$), the LLM likely failed to follow the preference despite having it in context: $g_t = 0.2$. \item If the reward is positive and a relevant memory was retrieved, retrieval contributed to the good outcome: $g_t = 0.6$; otherwise $g_t = 0.3$. \end{itemize} % This gating mechanism ensures that the REINFORCE update primarily adjusts the user vector when the retrieval policy---rather than the frozen LLM---is plausibly responsible for the observed outcome. We maintain a running baseline $b_u$ for each user to approximate their average reward: \[ b_u \leftarrow (1 - \alpha) b_u + \alpha \hat{r}_t, \] with a small smoothing coefficient $\alpha$. \paragraph{REINFORCE update.} Let $A_t \subseteq M_t$ be the set of memories actually used at time $t$ (e.g., the top-$k$ items under $s(\cdot)$). We define the average item vector of chosen memories: \[ v_{\text{chosen},t} = \frac{1}{|A_t|} \sum_{m_i \in A_t} v_{m_i}, \] and the policy expectation in item space: \[ \mu_t = \mathbb{E}_{m \sim \pi_z}[v_m] = \sum_{i=1}^{K} \pi_z(m_i \mid q_t, u) v_{m_i}. \] We then define a retrieval-specific advantage: \[ A^{\text{retr}}_t = g_t \,\bigl(\hat{r}_t - b_u\bigr), \] which down-weights updates when retrieval is unlikely responsible for the outcome (small $g_t$) or when the reward is close to the user's baseline performance. The user vectors are updated with a REINFORCE-style rule: \[ \Delta z^{(L)}_u = \eta_L \frac{A^{\text{retr}}_t}{\tau}\, (v_{\text{chosen},t} - \mu_t), \] \[ \Delta z^{(S)}_{u,t} = \eta_S \frac{A^{\text{retr}}_t}{\tau}\, (v_{\text{chosen},t} - \mu_t), \] with learning rates $\eta_L,\eta_S$. The short-term vector further undergoes exponential decay: \[ z^{(S)}_{u,t+1} = (1 - \lambda) z^{(S)}_{u,t} + \Delta z^{(S)}_{u,t}, \] for some decay rate $\lambda \in (0, 1)$. Positive advantage ($A^{\text{retr}}_t > 0$) moves the user vectors toward the directions of the actually used memories, while negative advantage pushes them away. Because the backbone LLM, embedding model, and reranker are frozen, the entire personalization signal is captured by these low-dimensional user vectors and the external preference memory. The scalar-reward interface ensures that, in future applications, we can swap in richer feedback sources (explicit or implicit) without redesigning the update rule. Intuitively, users who repeatedly make similar retrieval decisions under similar reward profiles will accumulate updates in similar directions in the item space, similar in spirit to classic collaborative filtering~\cite{hu2008implicitcf,koren2009matrix}, causing their long-term vectors $z^{(L)}_u$ to cluster; Section~\ref{sec:results} analyzes this emergent geometry and its relationship to ground-truth preference overlap. ```