1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
|
# Fix: Section 3.5 User Vector Definition
## Problem
Paper defines $z^{(L)}_u$ as the mean of item vectors, but code initializes it at zero and learns purely via REINFORCE.
## Change
Replace the $z^{(L)}$ definition paragraph. Everything else in Section 3.5 stays the same.
---
## Original (replace this)
```latex
\paragraph{Long-term and short-term user vectors.}
For each user $u$, let $H(u)$ be the set of memory cards
associated with $u$.
We define the long-term user vector as the mean of their item
vectors:
\[
z^{(L)}_u = \frac{1}{|H(u)|} \sum_{m \in H(u)} v_m \in \mathbb{R}^k.
\]
This vector captures stable preferences across sessions, such as
preferred language or typical level of detail.
We also maintain a short-term vector $z^{(S)}_{u,t} \in \mathbb{R}^k$
that captures session-specific drift and recency effects.
At the start of a session, $z^{(S)}_{u,0}$ is initialized to $0$
and updated online from feedback (Section~\ref{sec:method-rl}).
The effective user vector at time $t$ is a convex combination
\[
z^{\text{eff}}_{u,t} = \beta_L z^{(L)}_u + \beta_S z^{(S)}_{u,t},
\]
where $\beta_L, \beta_S \ge 0$ control the relative weight of
long- and short-term preferences.
In our style instantiation, $z^{(L)}_u$ encodes long-run style
tendencies (e.g., always short/Chinese), while $z^{(S)}_{u,t}$
can adapt to transient changes within a session.
```
## Replacement
```latex
\paragraph{Long-term and short-term user vectors.}
For each user $u$, we maintain two learned vectors in the
item space $\mathbb{R}^k$.
The \emph{long-term vector} $z^{(L)}_u \in \mathbb{R}^k$ is
initialized to zero and updated across sessions via the
REINFORCE rule in Section~\ref{sec:method-rl}.
Because it is never reset, $z^{(L)}_u$ accumulates gradient
information from all past interactions and captures stable
preferences such as preferred language or typical level of
detail.
The \emph{short-term vector} $z^{(S)}_{u,t} \in \mathbb{R}^k$
captures session-specific context and recency effects.
It is reset to zero at the start of each session and updated
within the session from turn-level feedback, with an
exponential decay that down-weights older signals
(Section~\ref{sec:method-rl}).
The effective user vector at time $t$ is a weighted combination
\[
z^{\text{eff}}_{u,t}
= \beta_L\, z^{(L)}_u + \beta_S\, z^{(S)}_{u,t},
\]
where $\beta_L, \beta_S \ge 0$ control the relative influence
of cross-session and within-session preferences.
Both vectors are learned entirely from interaction feedback;
no preference labels, pre-computed centroids, or explicit user
features are required.
```
|