diff options
| author | YurenHao0426 <blackhao0426@gmail.com> | 2026-02-11 02:29:27 +0000 |
|---|---|---|
| committer | YurenHao0426 <blackhao0426@gmail.com> | 2026-02-11 02:29:27 +0000 |
| commit | f23b25dda044046ef6d21ed9c2e28df6f54e04d6 (patch) | |
| tree | e065e31ae42d0fcfcd66c7628adffdf0391df805 /docs/reward_section_revised.md | |
| parent | 8af96d046e69fe9463ce89f000f06916cc043b31 (diff) | |
Add revised reward modeling LaTeX section matching code implementation
Key changes from original:
- Input: (q_t, a_t, q_{t+1}) only, removed A_t (not used in judge prompt)
- Single 7-label LLM classifier replaces abstract C_reward/C_gate
- Gating = classifier confidence (threshold tau_c=0.6), not memory attribution
- Explicitly describes Llama-3.1-8B-Instruct as judge model
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Diffstat (limited to 'docs/reward_section_revised.md')
| -rw-r--r-- | docs/reward_section_revised.md | 162 |
1 files changed, 162 insertions, 0 deletions
diff --git a/docs/reward_section_revised.md b/docs/reward_section_revised.md new file mode 100644 index 0000000..3dd259a --- /dev/null +++ b/docs/reward_section_revised.md @@ -0,0 +1,162 @@ +# Revised LaTeX: Reward Modeling and Online User Adaptation + +Below is the revised `\subsection{Reward Modeling and Online User Adaptation}` that matches the actual code implementation. + +## Changes from original + +1. **Input**: Removed $A_t$ from reward model input — code only uses $(q_t, a_t, q_{t+1})$ +2. **Classifier**: Replaced abstract $C_{\text{reward}}$ / $C_{\text{gate}}$ with single 7-label LLM-as-judge classifier + fixed reward mapping +3. **Gating**: Changed from "memory attribution" to confidence-based thresholding ($g_t = c_t$ when $c_t \geq \tau_c$, else 0) +4. **Model**: Explicitly describes Llama-3.1-8B-Instruct as the judge model +5. **REINFORCE section**: Kept as-is (already correct) + +--- + +## Revised LaTeX + +```latex +\subsection{Reward Modeling and Online User Adaptation} +\label{sec:method-rl} + +We do not assume a specific type of feedback such as +thumbs-up/down. +Instead, we treat all available interaction signals as being +mapped into a \emph{scalar reward} that measures how helpful +the system was for the user at a given turn~\cite{williams1992simple}. +In our framework, this scalar reward is the only interface +between feedback and the user-vector update: +different applications can plug in different feedback sources +as long as they can be converted into a real-valued reward +(e.g., explicit ratings, binary ``good/bad'' labels, dwell time, +click-through, or task-completion signals). + +\textbf{In this report, we instantiate this general interface +with a specific form of weak feedback available in our +MultiSessionCollab case study.} +Concretely, we use an automated LLM judge that observes the +user's follow-up message $q_{t+1}$ to derive a scalar reward +for the previous assistant response $a_t$. +If the user continues the task smoothly without restating +preferences, the response is treated as more rewarding; if +they restate preferences or express dissatisfaction, the +response is treated as less rewarding. + +\paragraph{LLM-as-judge reward model.} +A classifier $C$ takes as input the triple +$(q_t, a_t, q_{t+1})$---the user query, assistant response, +and user follow-up---and classifies the follow-up into one of +seven mutually exclusive labels: +% +\begin{itemize}[nosep] + \item \texttt{neg\_constraint\_restate}: user reasserts + previously stated preferences ($r = {-}1.0$); + \item \texttt{neg\_correction}: user indicates the response + is incorrect ($r = {-}0.8$); + \item \texttt{neg\_confusion}: user expresses confusion or + requests re-explanation ($r = {-}0.6$); + \item \texttt{pos\_praise}: explicit satisfaction or praise + ($r = {+}0.8$); + \item \texttt{pos\_progress}: constructive continuation such + as follow-up questions or extensions ($r = {+}0.1$); + \item \texttt{neutral}: ambiguous or minimal feedback + ($r = 0$); + \item \texttt{topic\_shift}: user switches to an unrelated + topic (update skipped). +\end{itemize} +% +Each label is mapped to a fixed scalar reward +$\hat{r}_t \in \{-1.0, -0.8, -0.6, 0, +0.1, +0.8\}$ +via a predefined reward map. +The classifier also outputs a confidence score +$c_t \in [0, 1]$. + +\paragraph{Confidence gating.} +We gate the update using the classifier's own confidence. +Define +\[ + g_t = + \begin{cases} + c_t & \text{if } c_t \geq \tau_c \text{ and label} + \neq \texttt{topic\_shift}, \\ + 0 & \text{otherwise}, + \end{cases} +\] +where $\tau_c = 0.6$ is a confidence threshold. +When $g_t = 0$ the reward is zeroed out and no user-vector +update is performed. +This prevents noisy or ambiguous signals from corrupting the +learned user representation. + +In our experiments we instantiate $C$ with a +Llama-3.1-8B-Instruct model served via vLLM, using +structured JSON output. +The judge receives only the $(q_t, a_t, q_{t+1})$ triple +with no access to the user's preference profile or retrieved +memories, ensuring that the reward signal reflects observable +interaction quality rather than oracle information. + +We maintain a running baseline $b_u$ for each user to approximate +their average reward: +\[ + b_u \leftarrow (1 - \alpha) b_u + \alpha \hat{r}_t, +\] +with a small smoothing coefficient $\alpha$. + +\paragraph{REINFORCE update.} +Let $A_t \subseteq M_t$ be the set of memories actually used at +time $t$ (e.g., the top-$k$ items under $s(\cdot)$). +We define the average item vector of chosen memories: +\[ + v_{\text{chosen},t} + = \frac{1}{|A_t|} + \sum_{m_i \in A_t} v_{m_i}, +\] +and the policy expectation in item space: +\[ + \mu_t = \mathbb{E}_{m \sim \pi_z}[v_m] + = \sum_{i=1}^{K} \pi_z(m_i \mid q_t, u) v_{m_i}. +\] +We then define a retrieval-specific advantage: +\[ + A^{\text{retr}}_t = g_t \,\bigl(\hat{r}_t - b_u\bigr), +\] +which down-weights updates when the classifier confidence is +low (small $g_t$) or when the reward is close to the user's +baseline performance. + +The user vectors are updated with a REINFORCE-style rule: +\[ + \Delta z^{(L)}_u + = \eta_L \frac{A^{\text{retr}}_t}{\tau}\, + (v_{\text{chosen},t} - \mu_t), +\] +\[ + \Delta z^{(S)}_{u,t} + = \eta_S \frac{A^{\text{retr}}_t}{\tau}\, + (v_{\text{chosen},t} - \mu_t), +\] +with learning rates $\eta_L,\eta_S$. +The short-term vector further undergoes exponential decay: +\[ + z^{(S)}_{u,t+1} + = (1 - \lambda) z^{(S)}_{u,t} + \Delta z^{(S)}_{u,t}, +\] +for some decay rate $\lambda \in (0, 1)$. +Positive advantage ($A^{\text{retr}}_t > 0$) moves the user +vectors toward the directions of the actually used memories, +while negative advantage pushes them away. +Because the backbone LLM, embedding model, and reranker are +frozen, the entire personalization signal is captured by these +low-dimensional user vectors and the external preference memory. +The scalar-reward interface ensures that, in future applications, +we can swap in richer feedback sources (explicit or implicit) +without redesigning the update rule. + +Intuitively, users who repeatedly make similar retrieval decisions +under similar reward profiles will accumulate updates in similar +directions in the item space, similar in spirit to classic +collaborative filtering~\cite{hu2008implicitcf,koren2009matrix}, +causing their long-term vectors $z^{(L)}_u$ to cluster; +Section~\ref{sec:results} analyzes this emergent geometry and its +relationship to ground-truth preference overlap. +``` |
