summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
-rw-r--r--docs/reward_section_revised.md162
1 files changed, 162 insertions, 0 deletions
diff --git a/docs/reward_section_revised.md b/docs/reward_section_revised.md
new file mode 100644
index 0000000..3dd259a
--- /dev/null
+++ b/docs/reward_section_revised.md
@@ -0,0 +1,162 @@
+# Revised LaTeX: Reward Modeling and Online User Adaptation
+
+Below is the revised `\subsection{Reward Modeling and Online User Adaptation}` that matches the actual code implementation.
+
+## Changes from original
+
+1. **Input**: Removed $A_t$ from reward model input — code only uses $(q_t, a_t, q_{t+1})$
+2. **Classifier**: Replaced abstract $C_{\text{reward}}$ / $C_{\text{gate}}$ with single 7-label LLM-as-judge classifier + fixed reward mapping
+3. **Gating**: Changed from "memory attribution" to confidence-based thresholding ($g_t = c_t$ when $c_t \geq \tau_c$, else 0)
+4. **Model**: Explicitly describes Llama-3.1-8B-Instruct as the judge model
+5. **REINFORCE section**: Kept as-is (already correct)
+
+---
+
+## Revised LaTeX
+
+```latex
+\subsection{Reward Modeling and Online User Adaptation}
+\label{sec:method-rl}
+
+We do not assume a specific type of feedback such as
+thumbs-up/down.
+Instead, we treat all available interaction signals as being
+mapped into a \emph{scalar reward} that measures how helpful
+the system was for the user at a given turn~\cite{williams1992simple}.
+In our framework, this scalar reward is the only interface
+between feedback and the user-vector update:
+different applications can plug in different feedback sources
+as long as they can be converted into a real-valued reward
+(e.g., explicit ratings, binary ``good/bad'' labels, dwell time,
+click-through, or task-completion signals).
+
+\textbf{In this report, we instantiate this general interface
+with a specific form of weak feedback available in our
+MultiSessionCollab case study.}
+Concretely, we use an automated LLM judge that observes the
+user's follow-up message $q_{t+1}$ to derive a scalar reward
+for the previous assistant response $a_t$.
+If the user continues the task smoothly without restating
+preferences, the response is treated as more rewarding; if
+they restate preferences or express dissatisfaction, the
+response is treated as less rewarding.
+
+\paragraph{LLM-as-judge reward model.}
+A classifier $C$ takes as input the triple
+$(q_t, a_t, q_{t+1})$---the user query, assistant response,
+and user follow-up---and classifies the follow-up into one of
+seven mutually exclusive labels:
+%
+\begin{itemize}[nosep]
+ \item \texttt{neg\_constraint\_restate}: user reasserts
+ previously stated preferences ($r = {-}1.0$);
+ \item \texttt{neg\_correction}: user indicates the response
+ is incorrect ($r = {-}0.8$);
+ \item \texttt{neg\_confusion}: user expresses confusion or
+ requests re-explanation ($r = {-}0.6$);
+ \item \texttt{pos\_praise}: explicit satisfaction or praise
+ ($r = {+}0.8$);
+ \item \texttt{pos\_progress}: constructive continuation such
+ as follow-up questions or extensions ($r = {+}0.1$);
+ \item \texttt{neutral}: ambiguous or minimal feedback
+ ($r = 0$);
+ \item \texttt{topic\_shift}: user switches to an unrelated
+ topic (update skipped).
+\end{itemize}
+%
+Each label is mapped to a fixed scalar reward
+$\hat{r}_t \in \{-1.0, -0.8, -0.6, 0, +0.1, +0.8\}$
+via a predefined reward map.
+The classifier also outputs a confidence score
+$c_t \in [0, 1]$.
+
+\paragraph{Confidence gating.}
+We gate the update using the classifier's own confidence.
+Define
+\[
+ g_t =
+ \begin{cases}
+ c_t & \text{if } c_t \geq \tau_c \text{ and label}
+ \neq \texttt{topic\_shift}, \\
+ 0 & \text{otherwise},
+ \end{cases}
+\]
+where $\tau_c = 0.6$ is a confidence threshold.
+When $g_t = 0$ the reward is zeroed out and no user-vector
+update is performed.
+This prevents noisy or ambiguous signals from corrupting the
+learned user representation.
+
+In our experiments we instantiate $C$ with a
+Llama-3.1-8B-Instruct model served via vLLM, using
+structured JSON output.
+The judge receives only the $(q_t, a_t, q_{t+1})$ triple
+with no access to the user's preference profile or retrieved
+memories, ensuring that the reward signal reflects observable
+interaction quality rather than oracle information.
+
+We maintain a running baseline $b_u$ for each user to approximate
+their average reward:
+\[
+ b_u \leftarrow (1 - \alpha) b_u + \alpha \hat{r}_t,
+\]
+with a small smoothing coefficient $\alpha$.
+
+\paragraph{REINFORCE update.}
+Let $A_t \subseteq M_t$ be the set of memories actually used at
+time $t$ (e.g., the top-$k$ items under $s(\cdot)$).
+We define the average item vector of chosen memories:
+\[
+ v_{\text{chosen},t}
+ = \frac{1}{|A_t|}
+ \sum_{m_i \in A_t} v_{m_i},
+\]
+and the policy expectation in item space:
+\[
+ \mu_t = \mathbb{E}_{m \sim \pi_z}[v_m]
+ = \sum_{i=1}^{K} \pi_z(m_i \mid q_t, u) v_{m_i}.
+\]
+We then define a retrieval-specific advantage:
+\[
+ A^{\text{retr}}_t = g_t \,\bigl(\hat{r}_t - b_u\bigr),
+\]
+which down-weights updates when the classifier confidence is
+low (small $g_t$) or when the reward is close to the user's
+baseline performance.
+
+The user vectors are updated with a REINFORCE-style rule:
+\[
+ \Delta z^{(L)}_u
+ = \eta_L \frac{A^{\text{retr}}_t}{\tau}\,
+ (v_{\text{chosen},t} - \mu_t),
+\]
+\[
+ \Delta z^{(S)}_{u,t}
+ = \eta_S \frac{A^{\text{retr}}_t}{\tau}\,
+ (v_{\text{chosen},t} - \mu_t),
+\]
+with learning rates $\eta_L,\eta_S$.
+The short-term vector further undergoes exponential decay:
+\[
+ z^{(S)}_{u,t+1}
+ = (1 - \lambda) z^{(S)}_{u,t} + \Delta z^{(S)}_{u,t},
+\]
+for some decay rate $\lambda \in (0, 1)$.
+Positive advantage ($A^{\text{retr}}_t > 0$) moves the user
+vectors toward the directions of the actually used memories,
+while negative advantage pushes them away.
+Because the backbone LLM, embedding model, and reranker are
+frozen, the entire personalization signal is captured by these
+low-dimensional user vectors and the external preference memory.
+The scalar-reward interface ensures that, in future applications,
+we can swap in richer feedback sources (explicit or implicit)
+without redesigning the update rule.
+
+Intuitively, users who repeatedly make similar retrieval decisions
+under similar reward profiles will accumulate updates in similar
+directions in the item space, similar in spirit to classic
+collaborative filtering~\cite{hu2008implicitcf,koren2009matrix},
+causing their long-term vectors $z^{(L)}_u$ to cluster;
+Section~\ref{sec:results} analyzes this emergent geometry and its
+relationship to ground-truth preference overlap.
+```