docs/reward_section_revised.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164

# Revised LaTeX: Reward Modeling and Online User Adaptation

Describes the keyword-based heuristic reward model (matching actual experiments).

## Key design

1. **Reward**: keyword matching on $q_{t+1}$ (negative/positive) + topic coherence via embedding cosine similarity
2. **Gating**: separate heuristic based on memory-query similarity → attributes reward to retrieval vs generation
3. **No additional model**: lightweight, runs without GPU
4. **REINFORCE update**: unchanged

---

## Revised LaTeX

```latex
\subsection{Reward Modeling and Online User Adaptation}
\label{sec:method-rl}

We do not assume a specific type of feedback such as
thumbs-up/down.
Instead, we treat all available interaction signals as being
mapped into a \emph{scalar reward} that measures how helpful
the system was for the user at a given turn~\cite{williams1992simple}.
In our framework, this scalar reward is the only interface
between feedback and the user-vector update:
different applications can plug in different feedback sources
as long as they can be converted into a real-valued reward
(e.g., explicit ratings, binary ``good/bad'' labels, dwell time,
click-through, or task-completion signals).

\textbf{In this report, we instantiate this general interface
with a specific form of weak feedback available in our
MultiSessionCollab case study.}
Concretely, we observe the user's follow-up message $q_{t+1}$
to derive a scalar reward for the previous assistant response
$a_t$.
If the user continues the task smoothly without restating
preferences, the response is treated as more rewarding; if
they restate preferences or express dissatisfaction, the
response is treated as less rewarding.

\paragraph{Keyword-based reward estimation.}
The reward function $\hat{r}_t$ is computed from the user's
follow-up $q_{t+1}$ using two lightweight signals:
%
\begin{enumerate}[nosep]
  \item \emph{Sentiment keywords.}
    We maintain curated lists of negative indicators
    (e.g., ``incorrect'', ``redo'', ``not what I asked'')
    and positive indicators
    (e.g., ``thanks'', ``continue'', ``what if'').
    A negative match contributes $-1.0$ to the reward;
    a positive match contributes up to $+1.0$.
  \item \emph{Topic coherence.}
    The cosine similarity between the query embeddings
    $e_{q_t}$ and $e_{q_{t+1}}$ detects topic shifts.
    When topic similarity is low ($< 0.2$), the reward is
    dampened, since a topic change makes it ambiguous whether
    the previous response was satisfactory or the user gave up.
\end{enumerate}
%
The final reward is clipped to $[-1, 1]$.
This heuristic requires no additional model or API call,
making it fast and suitable for large-scale evaluation.

\paragraph{Retrieval-attribution gating.}
Not all feedback is informative for updating the retrieval
policy: a negative reward caused by the LLM's generation
quality (rather than poor memory retrieval) should not
penalize the user vector.
We therefore compute a gating scalar
$g_t \in [0, 1]$ that estimates how much of the observed
reward should be attributed to the retrieval component.

Let $s^{\max}_q = \max_i \cos(e_{m_i}, e_{q_t})$ be the
maximum cosine similarity between any retrieved memory
embedding and the current query.
The gating heuristic distinguishes three cases:
%
\begin{itemize}[nosep]
  \item If the reward is strongly negative and no memory is
        similar to either query ($s^{\max}_q < 0.2$),
        retrieval is likely at fault (failed to surface
        relevant preferences): $g_t = 0.9$.
  \item If the reward is strongly negative but a relevant
        memory was retrieved ($s^{\max}_q > 0.5$), the LLM
        likely failed to follow the preference despite having
        it in context: $g_t = 0.2$.
  \item If the reward is positive and a relevant memory was
        retrieved, retrieval contributed to the good outcome:
        $g_t = 0.6$; otherwise $g_t = 0.3$.
\end{itemize}
%
This gating mechanism ensures that the REINFORCE update
primarily adjusts the user vector when the retrieval
policy---rather than the frozen LLM---is plausibly
responsible for the observed outcome.

We maintain a running baseline $b_u$ for each user to approximate
their average reward:
\[
  b_u \leftarrow (1 - \alpha) b_u + \alpha \hat{r}_t,
\]
with a small smoothing coefficient $\alpha$.

\paragraph{REINFORCE update.}
Let $A_t \subseteq M_t$ be the set of memories actually used at
time $t$ (e.g., the top-$k$ items under $s(\cdot)$).
We define the average item vector of chosen memories:
\[
  v_{\text{chosen},t}
  = \frac{1}{|A_t|}
    \sum_{m_i \in A_t} v_{m_i},
\]
and the policy expectation in item space:
\[
  \mu_t = \mathbb{E}_{m \sim \pi_z}[v_m]
        = \sum_{i=1}^{K} \pi_z(m_i \mid q_t, u) v_{m_i}.
\]
We then define a retrieval-specific advantage:
\[
  A^{\text{retr}}_t = g_t \,\bigl(\hat{r}_t - b_u\bigr),
\]
which down-weights updates when retrieval is unlikely
responsible for the outcome (small $g_t$) or when the reward
is close to the user's baseline performance.

The user vectors are updated with a REINFORCE-style rule:
\[
  \Delta z^{(L)}_u
  = \eta_L \frac{A^{\text{retr}}_t}{\tau}\,
    (v_{\text{chosen},t} - \mu_t),
\]
\[
  \Delta z^{(S)}_{u,t}
  = \eta_S \frac{A^{\text{retr}}_t}{\tau}\,
    (v_{\text{chosen},t} - \mu_t),
\]
with learning rates $\eta_L,\eta_S$.
The short-term vector further undergoes exponential decay:
\[
  z^{(S)}_{u,t+1}
  = (1 - \lambda) z^{(S)}_{u,t} + \Delta z^{(S)}_{u,t},
\]
for some decay rate $\lambda \in (0, 1)$.
Positive advantage ($A^{\text{retr}}_t > 0$) moves the user
vectors toward the directions of the actually used memories,
while negative advantage pushes them away.
Because the backbone LLM, embedding model, and reranker are
frozen, the entire personalization signal is captured by these
low-dimensional user vectors and the external preference memory.
The scalar-reward interface ensures that, in future applications,
we can swap in richer feedback sources (explicit or implicit)
without redesigning the update rule.

Intuitively, users who repeatedly make similar retrieval decisions
under similar reward profiles will accumulate updates in similar
directions in the item space, similar in spirit to classic
collaborative filtering~\cite{hu2008implicitcf,koren2009matrix},
causing their long-term vectors $z^{(L)}_u$ to cluster;
Section~\ref{sec:results} analyzes this emergent geometry and its
relationship to ground-truth preference overlap.
```