summaryrefslogtreecommitdiff
path: root/docs/reward_section_revised.md
diff options
context:
space:
mode:
authorYurenHao0426 <blackhao0426@gmail.com>2026-02-11 03:14:37 +0000
committerYurenHao0426 <blackhao0426@gmail.com>2026-02-11 03:14:37 +0000
commit6a917d3eda85e5725c2d5ad3bf5ec9bd30262198 (patch)
tree5c9408962f01036119ebe29cd34b45bf951865bd /docs/reward_section_revised.md
parent1956aed8bc8a72355adbe9f1d16ea678d67f214c (diff)
Rewrite reward section to describe keyword heuristic (matches experiments)
Replaced LLM-as-judge description with actual keyword-based system: - Reward: sentiment keyword matching + topic coherence via embedding similarity - Gating: separate retrieval-attribution heuristic using memory-query cosine similarity (g_t=0.9 retrieval fault, g_t=0.2 LLM fault, etc.) - No additional model needed (fast, no GPU) - REINFORCE update unchanged Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Diffstat (limited to 'docs/reward_section_revised.md')
-rw-r--r--docs/reward_section_revised.md126
1 files changed, 64 insertions, 62 deletions
diff --git a/docs/reward_section_revised.md b/docs/reward_section_revised.md
index 3dd259a..86d5b08 100644
--- a/docs/reward_section_revised.md
+++ b/docs/reward_section_revised.md
@@ -1,14 +1,13 @@
# Revised LaTeX: Reward Modeling and Online User Adaptation
-Below is the revised `\subsection{Reward Modeling and Online User Adaptation}` that matches the actual code implementation.
+Describes the keyword-based heuristic reward model (matching actual experiments).
-## Changes from original
+## Key design
-1. **Input**: Removed $A_t$ from reward model input — code only uses $(q_t, a_t, q_{t+1})$
-2. **Classifier**: Replaced abstract $C_{\text{reward}}$ / $C_{\text{gate}}$ with single 7-label LLM-as-judge classifier + fixed reward mapping
-3. **Gating**: Changed from "memory attribution" to confidence-based thresholding ($g_t = c_t$ when $c_t \geq \tau_c$, else 0)
-4. **Model**: Explicitly describes Llama-3.1-8B-Instruct as the judge model
-5. **REINFORCE section**: Kept as-is (already correct)
+1. **Reward**: keyword matching on $q_{t+1}$ (negative/positive) + topic coherence via embedding cosine similarity
+2. **Gating**: separate heuristic based on memory-query similarity → attributes reward to retrieval vs generation
+3. **No additional model**: lightweight, runs without GPU
+4. **REINFORCE update**: unchanged
---
@@ -33,67 +32,70 @@ click-through, or task-completion signals).
\textbf{In this report, we instantiate this general interface
with a specific form of weak feedback available in our
MultiSessionCollab case study.}
-Concretely, we use an automated LLM judge that observes the
-user's follow-up message $q_{t+1}$ to derive a scalar reward
-for the previous assistant response $a_t$.
+Concretely, we observe the user's follow-up message $q_{t+1}$
+to derive a scalar reward for the previous assistant response
+$a_t$.
If the user continues the task smoothly without restating
preferences, the response is treated as more rewarding; if
they restate preferences or express dissatisfaction, the
response is treated as less rewarding.
-\paragraph{LLM-as-judge reward model.}
-A classifier $C$ takes as input the triple
-$(q_t, a_t, q_{t+1})$---the user query, assistant response,
-and user follow-up---and classifies the follow-up into one of
-seven mutually exclusive labels:
+\paragraph{Keyword-based reward estimation.}
+The reward function $\hat{r}_t$ is computed from the user's
+follow-up $q_{t+1}$ using two lightweight signals:
%
-\begin{itemize}[nosep]
- \item \texttt{neg\_constraint\_restate}: user reasserts
- previously stated preferences ($r = {-}1.0$);
- \item \texttt{neg\_correction}: user indicates the response
- is incorrect ($r = {-}0.8$);
- \item \texttt{neg\_confusion}: user expresses confusion or
- requests re-explanation ($r = {-}0.6$);
- \item \texttt{pos\_praise}: explicit satisfaction or praise
- ($r = {+}0.8$);
- \item \texttt{pos\_progress}: constructive continuation such
- as follow-up questions or extensions ($r = {+}0.1$);
- \item \texttt{neutral}: ambiguous or minimal feedback
- ($r = 0$);
- \item \texttt{topic\_shift}: user switches to an unrelated
- topic (update skipped).
-\end{itemize}
+\begin{enumerate}[nosep]
+ \item \emph{Sentiment keywords.}
+ We maintain curated lists of negative indicators
+ (e.g., ``incorrect'', ``redo'', ``not what I asked'')
+ and positive indicators
+ (e.g., ``thanks'', ``continue'', ``what if'').
+ A negative match contributes $-1.0$ to the reward;
+ a positive match contributes up to $+1.0$.
+ \item \emph{Topic coherence.}
+ The cosine similarity between the query embeddings
+ $e_{q_t}$ and $e_{q_{t+1}}$ detects topic shifts.
+ When topic similarity is low ($< 0.2$), the reward is
+ dampened, since a topic change makes it ambiguous whether
+ the previous response was satisfactory or the user gave up.
+\end{enumerate}
%
-Each label is mapped to a fixed scalar reward
-$\hat{r}_t \in \{-1.0, -0.8, -0.6, 0, +0.1, +0.8\}$
-via a predefined reward map.
-The classifier also outputs a confidence score
-$c_t \in [0, 1]$.
+The final reward is clipped to $[-1, 1]$.
+This heuristic requires no additional model or API call,
+making it fast and suitable for large-scale evaluation.
-\paragraph{Confidence gating.}
-We gate the update using the classifier's own confidence.
-Define
-\[
- g_t =
- \begin{cases}
- c_t & \text{if } c_t \geq \tau_c \text{ and label}
- \neq \texttt{topic\_shift}, \\
- 0 & \text{otherwise},
- \end{cases}
-\]
-where $\tau_c = 0.6$ is a confidence threshold.
-When $g_t = 0$ the reward is zeroed out and no user-vector
-update is performed.
-This prevents noisy or ambiguous signals from corrupting the
-learned user representation.
+\paragraph{Retrieval-attribution gating.}
+Not all feedback is informative for updating the retrieval
+policy: a negative reward caused by the LLM's generation
+quality (rather than poor memory retrieval) should not
+penalize the user vector.
+We therefore compute a gating scalar
+$g_t \in [0, 1]$ that estimates how much of the observed
+reward should be attributed to the retrieval component.
-In our experiments we instantiate $C$ with a
-Llama-3.1-8B-Instruct model served via vLLM, using
-structured JSON output.
-The judge receives only the $(q_t, a_t, q_{t+1})$ triple
-with no access to the user's preference profile or retrieved
-memories, ensuring that the reward signal reflects observable
-interaction quality rather than oracle information.
+Let $s^{\max}_q = \max_i \cos(e_{m_i}, e_{q_t})$ be the
+maximum cosine similarity between any retrieved memory
+embedding and the current query.
+The gating heuristic distinguishes three cases:
+%
+\begin{itemize}[nosep]
+ \item If the reward is strongly negative and no memory is
+ similar to either query ($s^{\max}_q < 0.2$),
+ retrieval is likely at fault (failed to surface
+ relevant preferences): $g_t = 0.9$.
+ \item If the reward is strongly negative but a relevant
+ memory was retrieved ($s^{\max}_q > 0.5$), the LLM
+ likely failed to follow the preference despite having
+ it in context: $g_t = 0.2$.
+ \item If the reward is positive and a relevant memory was
+ retrieved, retrieval contributed to the good outcome:
+ $g_t = 0.6$; otherwise $g_t = 0.3$.
+\end{itemize}
+%
+This gating mechanism ensures that the REINFORCE update
+primarily adjusts the user vector when the retrieval
+policy---rather than the frozen LLM---is plausibly
+responsible for the observed outcome.
We maintain a running baseline $b_u$ for each user to approximate
their average reward:
@@ -120,9 +122,9 @@ We then define a retrieval-specific advantage:
\[
A^{\text{retr}}_t = g_t \,\bigl(\hat{r}_t - b_u\bigr),
\]
-which down-weights updates when the classifier confidence is
-low (small $g_t$) or when the reward is close to the user's
-baseline performance.
+which down-weights updates when retrieval is unlikely
+responsible for the outcome (small $g_t$) or when the reward
+is close to the user's baseline performance.
The user vectors are updated with a REINFORCE-style rule:
\[