diff options
| author | YurenHao0426 <blackhao0426@gmail.com> | 2026-02-11 03:14:37 +0000 |
|---|---|---|
| committer | YurenHao0426 <blackhao0426@gmail.com> | 2026-02-11 03:14:37 +0000 |
| commit | 6a917d3eda85e5725c2d5ad3bf5ec9bd30262198 (patch) | |
| tree | 5c9408962f01036119ebe29cd34b45bf951865bd /docs/reward_section_revised.md | |
| parent | 1956aed8bc8a72355adbe9f1d16ea678d67f214c (diff) | |
Rewrite reward section to describe keyword heuristic (matches experiments)
Replaced LLM-as-judge description with actual keyword-based system:
- Reward: sentiment keyword matching + topic coherence via embedding similarity
- Gating: separate retrieval-attribution heuristic using memory-query cosine
similarity (g_t=0.9 retrieval fault, g_t=0.2 LLM fault, etc.)
- No additional model needed (fast, no GPU)
- REINFORCE update unchanged
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Diffstat (limited to 'docs/reward_section_revised.md')
| -rw-r--r-- | docs/reward_section_revised.md | 126 |
1 files changed, 64 insertions, 62 deletions
diff --git a/docs/reward_section_revised.md b/docs/reward_section_revised.md index 3dd259a..86d5b08 100644 --- a/docs/reward_section_revised.md +++ b/docs/reward_section_revised.md @@ -1,14 +1,13 @@ # Revised LaTeX: Reward Modeling and Online User Adaptation -Below is the revised `\subsection{Reward Modeling and Online User Adaptation}` that matches the actual code implementation. +Describes the keyword-based heuristic reward model (matching actual experiments). -## Changes from original +## Key design -1. **Input**: Removed $A_t$ from reward model input — code only uses $(q_t, a_t, q_{t+1})$ -2. **Classifier**: Replaced abstract $C_{\text{reward}}$ / $C_{\text{gate}}$ with single 7-label LLM-as-judge classifier + fixed reward mapping -3. **Gating**: Changed from "memory attribution" to confidence-based thresholding ($g_t = c_t$ when $c_t \geq \tau_c$, else 0) -4. **Model**: Explicitly describes Llama-3.1-8B-Instruct as the judge model -5. **REINFORCE section**: Kept as-is (already correct) +1. **Reward**: keyword matching on $q_{t+1}$ (negative/positive) + topic coherence via embedding cosine similarity +2. **Gating**: separate heuristic based on memory-query similarity → attributes reward to retrieval vs generation +3. **No additional model**: lightweight, runs without GPU +4. **REINFORCE update**: unchanged --- @@ -33,67 +32,70 @@ click-through, or task-completion signals). \textbf{In this report, we instantiate this general interface with a specific form of weak feedback available in our MultiSessionCollab case study.} -Concretely, we use an automated LLM judge that observes the -user's follow-up message $q_{t+1}$ to derive a scalar reward -for the previous assistant response $a_t$. +Concretely, we observe the user's follow-up message $q_{t+1}$ +to derive a scalar reward for the previous assistant response +$a_t$. If the user continues the task smoothly without restating preferences, the response is treated as more rewarding; if they restate preferences or express dissatisfaction, the response is treated as less rewarding. -\paragraph{LLM-as-judge reward model.} -A classifier $C$ takes as input the triple -$(q_t, a_t, q_{t+1})$---the user query, assistant response, -and user follow-up---and classifies the follow-up into one of -seven mutually exclusive labels: +\paragraph{Keyword-based reward estimation.} +The reward function $\hat{r}_t$ is computed from the user's +follow-up $q_{t+1}$ using two lightweight signals: % -\begin{itemize}[nosep] - \item \texttt{neg\_constraint\_restate}: user reasserts - previously stated preferences ($r = {-}1.0$); - \item \texttt{neg\_correction}: user indicates the response - is incorrect ($r = {-}0.8$); - \item \texttt{neg\_confusion}: user expresses confusion or - requests re-explanation ($r = {-}0.6$); - \item \texttt{pos\_praise}: explicit satisfaction or praise - ($r = {+}0.8$); - \item \texttt{pos\_progress}: constructive continuation such - as follow-up questions or extensions ($r = {+}0.1$); - \item \texttt{neutral}: ambiguous or minimal feedback - ($r = 0$); - \item \texttt{topic\_shift}: user switches to an unrelated - topic (update skipped). -\end{itemize} +\begin{enumerate}[nosep] + \item \emph{Sentiment keywords.} + We maintain curated lists of negative indicators + (e.g., ``incorrect'', ``redo'', ``not what I asked'') + and positive indicators + (e.g., ``thanks'', ``continue'', ``what if''). + A negative match contributes $-1.0$ to the reward; + a positive match contributes up to $+1.0$. + \item \emph{Topic coherence.} + The cosine similarity between the query embeddings + $e_{q_t}$ and $e_{q_{t+1}}$ detects topic shifts. + When topic similarity is low ($< 0.2$), the reward is + dampened, since a topic change makes it ambiguous whether + the previous response was satisfactory or the user gave up. +\end{enumerate} % -Each label is mapped to a fixed scalar reward -$\hat{r}_t \in \{-1.0, -0.8, -0.6, 0, +0.1, +0.8\}$ -via a predefined reward map. -The classifier also outputs a confidence score -$c_t \in [0, 1]$. +The final reward is clipped to $[-1, 1]$. +This heuristic requires no additional model or API call, +making it fast and suitable for large-scale evaluation. -\paragraph{Confidence gating.} -We gate the update using the classifier's own confidence. -Define -\[ - g_t = - \begin{cases} - c_t & \text{if } c_t \geq \tau_c \text{ and label} - \neq \texttt{topic\_shift}, \\ - 0 & \text{otherwise}, - \end{cases} -\] -where $\tau_c = 0.6$ is a confidence threshold. -When $g_t = 0$ the reward is zeroed out and no user-vector -update is performed. -This prevents noisy or ambiguous signals from corrupting the -learned user representation. +\paragraph{Retrieval-attribution gating.} +Not all feedback is informative for updating the retrieval +policy: a negative reward caused by the LLM's generation +quality (rather than poor memory retrieval) should not +penalize the user vector. +We therefore compute a gating scalar +$g_t \in [0, 1]$ that estimates how much of the observed +reward should be attributed to the retrieval component. -In our experiments we instantiate $C$ with a -Llama-3.1-8B-Instruct model served via vLLM, using -structured JSON output. -The judge receives only the $(q_t, a_t, q_{t+1})$ triple -with no access to the user's preference profile or retrieved -memories, ensuring that the reward signal reflects observable -interaction quality rather than oracle information. +Let $s^{\max}_q = \max_i \cos(e_{m_i}, e_{q_t})$ be the +maximum cosine similarity between any retrieved memory +embedding and the current query. +The gating heuristic distinguishes three cases: +% +\begin{itemize}[nosep] + \item If the reward is strongly negative and no memory is + similar to either query ($s^{\max}_q < 0.2$), + retrieval is likely at fault (failed to surface + relevant preferences): $g_t = 0.9$. + \item If the reward is strongly negative but a relevant + memory was retrieved ($s^{\max}_q > 0.5$), the LLM + likely failed to follow the preference despite having + it in context: $g_t = 0.2$. + \item If the reward is positive and a relevant memory was + retrieved, retrieval contributed to the good outcome: + $g_t = 0.6$; otherwise $g_t = 0.3$. +\end{itemize} +% +This gating mechanism ensures that the REINFORCE update +primarily adjusts the user vector when the retrieval +policy---rather than the frozen LLM---is plausibly +responsible for the observed outcome. We maintain a running baseline $b_u$ for each user to approximate their average reward: @@ -120,9 +122,9 @@ We then define a retrieval-specific advantage: \[ A^{\text{retr}}_t = g_t \,\bigl(\hat{r}_t - b_u\bigr), \] -which down-weights updates when the classifier confidence is -low (small $g_t$) or when the reward is close to the user's -baseline performance. +which down-weights updates when retrieval is unlikely +responsible for the outcome (small $g_t$) or when the reward +is close to the user's baseline performance. The user vectors are updated with a REINFORCE-style rule: \[ |
