docs/reward_section_revised.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162

# Revised LaTeX: Reward Modeling and Online User Adaptation

Below is the revised `\subsection{Reward Modeling and Online User Adaptation}` that matches the actual code implementation.

## Changes from original

1. **Input**: Removed $A_t$ from reward model input — code only uses $(q_t, a_t, q_{t+1})$
2. **Classifier**: Replaced abstract $C_{\text{reward}}$ / $C_{\text{gate}}$ with single 7-label LLM-as-judge classifier + fixed reward mapping
3. **Gating**: Changed from "memory attribution" to confidence-based thresholding ($g_t = c_t$ when $c_t \geq \tau_c$, else 0)
4. **Model**: Explicitly describes Llama-3.1-8B-Instruct as the judge model
5. **REINFORCE section**: Kept as-is (already correct)

---

## Revised LaTeX

```latex
\subsection{Reward Modeling and Online User Adaptation}
\label{sec:method-rl}

We do not assume a specific type of feedback such as
thumbs-up/down.
Instead, we treat all available interaction signals as being
mapped into a \emph{scalar reward} that measures how helpful
the system was for the user at a given turn~\cite{williams1992simple}.
In our framework, this scalar reward is the only interface
between feedback and the user-vector update:
different applications can plug in different feedback sources
as long as they can be converted into a real-valued reward
(e.g., explicit ratings, binary ``good/bad'' labels, dwell time,
click-through, or task-completion signals).

\textbf{In this report, we instantiate this general interface
with a specific form of weak feedback available in our
MultiSessionCollab case study.}
Concretely, we use an automated LLM judge that observes the
user's follow-up message $q_{t+1}$ to derive a scalar reward
for the previous assistant response $a_t$.
If the user continues the task smoothly without restating
preferences, the response is treated as more rewarding; if
they restate preferences or express dissatisfaction, the
response is treated as less rewarding.

\paragraph{LLM-as-judge reward model.}
A classifier $C$ takes as input the triple
$(q_t, a_t, q_{t+1})$---the user query, assistant response,
and user follow-up---and classifies the follow-up into one of
seven mutually exclusive labels:
%
\begin{itemize}[nosep]
  \item \texttt{neg\_constraint\_restate}: user reasserts
        previously stated preferences ($r = {-}1.0$);
  \item \texttt{neg\_correction}: user indicates the response
        is incorrect ($r = {-}0.8$);
  \item \texttt{neg\_confusion}: user expresses confusion or
        requests re-explanation ($r = {-}0.6$);
  \item \texttt{pos\_praise}: explicit satisfaction or praise
        ($r = {+}0.8$);
  \item \texttt{pos\_progress}: constructive continuation such
        as follow-up questions or extensions ($r = {+}0.1$);
  \item \texttt{neutral}: ambiguous or minimal feedback
        ($r = 0$);
  \item \texttt{topic\_shift}: user switches to an unrelated
        topic (update skipped).
\end{itemize}
%
Each label is mapped to a fixed scalar reward
$\hat{r}_t \in \{-1.0, -0.8, -0.6, 0, +0.1, +0.8\}$
via a predefined reward map.
The classifier also outputs a confidence score
$c_t \in [0, 1]$.

\paragraph{Confidence gating.}
We gate the update using the classifier's own confidence.
Define
\[
  g_t =
  \begin{cases}
    c_t & \text{if } c_t \geq \tau_c \text{ and label}
          \neq \texttt{topic\_shift}, \\
    0   & \text{otherwise},
  \end{cases}
\]
where $\tau_c = 0.6$ is a confidence threshold.
When $g_t = 0$ the reward is zeroed out and no user-vector
update is performed.
This prevents noisy or ambiguous signals from corrupting the
learned user representation.

In our experiments we instantiate $C$ with a
Llama-3.1-8B-Instruct model served via vLLM, using
structured JSON output.
The judge receives only the $(q_t, a_t, q_{t+1})$ triple
with no access to the user's preference profile or retrieved
memories, ensuring that the reward signal reflects observable
interaction quality rather than oracle information.

We maintain a running baseline $b_u$ for each user to approximate
their average reward:
\[
  b_u \leftarrow (1 - \alpha) b_u + \alpha \hat{r}_t,
\]
with a small smoothing coefficient $\alpha$.

\paragraph{REINFORCE update.}
Let $A_t \subseteq M_t$ be the set of memories actually used at
time $t$ (e.g., the top-$k$ items under $s(\cdot)$).
We define the average item vector of chosen memories:
\[
  v_{\text{chosen},t}
  = \frac{1}{|A_t|}
    \sum_{m_i \in A_t} v_{m_i},
\]
and the policy expectation in item space:
\[
  \mu_t = \mathbb{E}_{m \sim \pi_z}[v_m]
        = \sum_{i=1}^{K} \pi_z(m_i \mid q_t, u) v_{m_i}.
\]
We then define a retrieval-specific advantage:
\[
  A^{\text{retr}}_t = g_t \,\bigl(\hat{r}_t - b_u\bigr),
\]
which down-weights updates when the classifier confidence is
low (small $g_t$) or when the reward is close to the user's
baseline performance.

The user vectors are updated with a REINFORCE-style rule:
\[
  \Delta z^{(L)}_u
  = \eta_L \frac{A^{\text{retr}}_t}{\tau}\,
    (v_{\text{chosen},t} - \mu_t),
\]
\[
  \Delta z^{(S)}_{u,t}
  = \eta_S \frac{A^{\text{retr}}_t}{\tau}\,
    (v_{\text{chosen},t} - \mu_t),
\]
with learning rates $\eta_L,\eta_S$.
The short-term vector further undergoes exponential decay:
\[
  z^{(S)}_{u,t+1}
  = (1 - \lambda) z^{(S)}_{u,t} + \Delta z^{(S)}_{u,t},
\]
for some decay rate $\lambda \in (0, 1)$.
Positive advantage ($A^{\text{retr}}_t > 0$) moves the user
vectors toward the directions of the actually used memories,
while negative advantage pushes them away.
Because the backbone LLM, embedding model, and reranker are
frozen, the entire personalization signal is captured by these
low-dimensional user vectors and the external preference memory.
The scalar-reward interface ensures that, in future applications,
we can swap in richer feedback sources (explicit or implicit)
without redesigning the update rule.

Intuitively, users who repeatedly make similar retrieval decisions
under similar reward profiles will accumulate updates in similar
directions in the item space, similar in spirit to classic
collaborative filtering~\cite{hu2008implicitcf,koren2009matrix},
causing their long-term vectors $z^{(L)}_u$ to cluster;
Section~\ref{sec:results} analyzes this emergent geometry and its
relationship to ground-truth preference overlap.
```