Rewrite reward section to describe keyword heuristic (matches experiments)

Replaced LLM-as-judge description with actual keyword-based system: - Reward: sentiment keyword matching + topic coherence via embedding similarity - Gating: separate retrieval-attribution heuristic using memory-query cosine similarity (g_t=0.9 retrieval fault, g_t=0.2 LLM fault, etc.) - No additional model needed (fast, no GPU) - REINFORCE update unchanged Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
author: YurenHao0426 <blackhao0426@gmail.com> 2026-02-11 03:14:37 +0000
committer: YurenHao0426 <blackhao0426@gmail.com> 2026-02-11 03:14:37 +0000
commit: 6a917d3eda85e5725c2d5ad3bf5ec9bd30262198 (patch)
tree: 5c9408962f01036119ebe29cd34b45bf951865bd /docs/reward_section_revised.md
parent: 1956aed8bc8a72355adbe9f1d16ea678d67f214c (diff)
1 files changed, 64 insertions, 62 deletions
diff --git a/docs/reward_section_revised.md b/docs/reward_section_revised.md
index 3dd259a..86d5b08 100644
--- a/docs/reward_section_revised.md
+++ b/docs/reward_section_revised.md
@@ -1,14 +1,13 @@
 # Revised LaTeX: Reward Modeling and Online User Adaptation
 
-Below is the revised `\subsection{Reward Modeling and Online User Adaptation}` that matches the actual code implementation.
+Describes the keyword-based heuristic reward model (matching actual experiments).
 
-## Changes from original
+## Key design
 
-1. **Input**: Removed $A_t$ from reward model input — code only uses $(q_t, a_t, q_{t+1})$
-2. **Classifier**: Replaced abstract $C_{\text{reward}}$ / $C_{\text{gate}}$ with single 7-label LLM-as-judge classifier + fixed reward mapping
-3. **Gating**: Changed from "memory attribution" to confidence-based thresholding ($g_t = c_t$ when $c_t \geq \tau_c$, else 0)
-4. **Model**: Explicitly describes Llama-3.1-8B-Instruct as the judge model
-5. **REINFORCE section**: Kept as-is (already correct)
+1. **Reward**: keyword matching on $q_{t+1}$ (negative/positive) + topic coherence via embedding cosine similarity
+2. **Gating**: separate heuristic based on memory-query similarity → attributes reward to retrieval vs generation
+3. **No additional model**: lightweight, runs without GPU
+4. **REINFORCE update**: unchanged
 
 ---
 
@@ -33,67 +32,70 @@ click-through, or task-completion signals).
 \textbf{In this report, we instantiate this general interface
 with a specific form of weak feedback available in our
 MultiSessionCollab case study.}
-Concretely, we use an automated LLM judge that observes the
-user's follow-up message $q_{t+1}$ to derive a scalar reward
-for the previous assistant response $a_t$.
+Concretely, we observe the user's follow-up message $q_{t+1}$
+to derive a scalar reward for the previous assistant response
+$a_t$.
 If the user continues the task smoothly without restating
 preferences, the response is treated as more rewarding; if
 they restate preferences or express dissatisfaction, the
 response is treated as less rewarding.
 
-\paragraph{LLM-as-judge reward model.}
-A classifier $C$ takes as input the triple
-$(q_t, a_t, q_{t+1})$---the user query, assistant response,
-and user follow-up---and classifies the follow-up into one of
-seven mutually exclusive labels:
+\paragraph{Keyword-based reward estimation.}
+The reward function $\hat{r}_t$ is computed from the user's
+follow-up $q_{t+1}$ using two lightweight signals:
 %
-\begin{itemize}[nosep]
-  \item \texttt{neg\_constraint\_restate}: user reasserts
-        previously stated preferences ($r = {-}1.0$);
-  \item \texttt{neg\_correction}: user indicates the response
-        is incorrect ($r = {-}0.8$);
-  \item \texttt{neg\_confusion}: user expresses confusion or
-        requests re-explanation ($r = {-}0.6$);
-  \item \texttt{pos\_praise}: explicit satisfaction or praise
-        ($r = {+}0.8$);
-  \item \texttt{pos\_progress}: constructive continuation such
-        as follow-up questions or extensions ($r = {+}0.1$);
-  \item \texttt{neutral}: ambiguous or minimal feedback
-        ($r = 0$);
-  \item \texttt{topic\_shift}: user switches to an unrelated
-        topic (update skipped).
-\end{itemize}
+\begin{enumerate}[nosep]
+  \item \emph{Sentiment keywords.}
+    We maintain curated lists of negative indicators
+    (e.g., ``incorrect'', ``redo'', ``not what I asked'')
+    and positive indicators
+    (e.g., ``thanks'', ``continue'', ``what if'').
+    A negative match contributes $-1.0$ to the reward;
+    a positive match contributes up to $+1.0$.
+  \item \emph{Topic coherence.}
+    The cosine similarity between the query embeddings
+    $e_{q_t}$ and $e_{q_{t+1}}$ detects topic shifts.
+    When topic similarity is low ($< 0.2$), the reward is
+    dampened, since a topic change makes it ambiguous whether
+    the previous response was satisfactory or the user gave up.
+\end{enumerate}
 %
-Each label is mapped to a fixed scalar reward
-$\hat{r}_t \in \{-1.0, -0.8, -0.6, 0, +0.1, +0.8\}$
-via a predefined reward map.
-The classifier also outputs a confidence score
-$c_t \in [0, 1]$.
+The final reward is clipped to $[-1, 1]$.
+This heuristic requires no additional model or API call,
+making it fast and suitable for large-scale evaluation.
 
-\paragraph{Confidence gating.}
-We gate the update using the classifier's own confidence.
-Define
-\[
-  g_t =
-  \begin{cases}
-    c_t & \text{if } c_t \geq \tau_c \text{ and label}
-          \neq \texttt{topic\_shift}, \\
-    0   & \text{otherwise},
-  \end{cases}
-\]
-where $\tau_c = 0.6$ is a confidence threshold.
-When $g_t = 0$ the reward is zeroed out and no user-vector
-update is performed.
-This prevents noisy or ambiguous signals from corrupting the
-learned user representation.
+\paragraph{Retrieval-attribution gating.}
+Not all feedback is informative for updating the retrieval
+policy: a negative reward caused by the LLM's generation
+quality (rather than poor memory retrieval) should not
+penalize the user vector.
+We therefore compute a gating scalar
+$g_t \in [0, 1]$ that estimates how much of the observed
+reward should be attributed to the retrieval component.
 
-In our experiments we instantiate $C$ with a
-Llama-3.1-8B-Instruct model served via vLLM, using
-structured JSON output.
-The judge receives only the $(q_t, a_t, q_{t+1})$ triple
-with no access to the user's preference profile or retrieved
-memories, ensuring that the reward signal reflects observable
-interaction quality rather than oracle information.
+Let $s^{\max}_q = \max_i \cos(e_{m_i}, e_{q_t})$ be the
+maximum cosine similarity between any retrieved memory
+embedding and the current query.
+The gating heuristic distinguishes three cases:
+%
+\begin{itemize}[nosep]
+  \item If the reward is strongly negative and no memory is
+        similar to either query ($s^{\max}_q < 0.2$),
+        retrieval is likely at fault (failed to surface
+        relevant preferences): $g_t = 0.9$.
+  \item If the reward is strongly negative but a relevant
+        memory was retrieved ($s^{\max}_q > 0.5$), the LLM
+        likely failed to follow the preference despite having
+        it in context: $g_t = 0.2$.
+  \item If the reward is positive and a relevant memory was
+        retrieved, retrieval contributed to the good outcome:
+        $g_t = 0.6$; otherwise $g_t = 0.3$.
+\end{itemize}
+%
+This gating mechanism ensures that the REINFORCE update
+primarily adjusts the user vector when the retrieval
+policy---rather than the frozen LLM---is plausibly
+responsible for the observed outcome.
 
 We maintain a running baseline $b_u$ for each user to approximate
 their average reward:
@@ -120,9 +122,9 @@ We then define a retrieval-specific advantage:
 \[
   A^{\text{retr}}_t = g_t \,\bigl(\hat{r}_t - b_u\bigr),
 \]
-which down-weights updates when the classifier confidence is
-low (small $g_t$) or when the reward is close to the user's
-baseline performance.
+which down-weights updates when retrieval is unlikely
+responsible for the outcome (small $g_t$) or when the reward
+is close to the user's baseline performance.
 
 The user vectors are updated with a REINFORCE-style rule:
 \[
author	YurenHao0426 <blackhao0426@gmail.com>	2026-02-11 03:14:37 +0000
committer	YurenHao0426 <blackhao0426@gmail.com>	2026-02-11 03:14:37 +0000
commit	6a917d3eda85e5725c2d5ad3bf5ec9bd30262198 (patch)
tree	5c9408962f01036119ebe29cd34b45bf951865bd /docs/reward_section_revised.md
parent	1956aed8bc8a72355adbe9f1d16ea678d67f214c (diff)