# Fix: Section 3.5 User Vector Definition

## Problem

Paper defines $z^{(L)}_u$ as the mean of item vectors, but code initializes it at zero and learns purely via REINFORCE.

## Change

Replace the $z^{(L)}$ definition paragraph. Everything else in Section 3.5 stays the same.

---

## Original (replace this)

```latex
\paragraph{Long-term and short-term user vectors.}
For each user $u$, let $H(u)$ be the set of memory cards
associated with $u$.
We define the long-term user vector as the mean of their item
vectors:
\[
  z^{(L)}_u = \frac{1}{|H(u)|} \sum_{m \in H(u)} v_m \in \mathbb{R}^k.
\]
This vector captures stable preferences across sessions, such as
preferred language or typical level of detail.

We also maintain a short-term vector $z^{(S)}_{u,t} \in \mathbb{R}^k$
that captures session-specific drift and recency effects.
At the start of a session, $z^{(S)}_{u,0}$ is initialized to $0$
and updated online from feedback (Section~\ref{sec:method-rl}).
The effective user vector at time $t$ is a convex combination
\[
  z^{\text{eff}}_{u,t} = \beta_L z^{(L)}_u + \beta_S z^{(S)}_{u,t},
\]
where $\beta_L, \beta_S \ge 0$ control the relative weight of
long- and short-term preferences.
In our style instantiation, $z^{(L)}_u$ encodes long-run style
tendencies (e.g., always short/Chinese), while $z^{(S)}_{u,t}$
can adapt to transient changes within a session.
```

## Replacement

```latex
\paragraph{Long-term and short-term user vectors.}
For each user $u$, we maintain two learned vectors in the
item space $\mathbb{R}^k$.

The \emph{long-term vector} $z^{(L)}_u \in \mathbb{R}^k$ is
initialized to zero and updated across sessions via the
REINFORCE rule in Section~\ref{sec:method-rl}.
Because it is never reset, $z^{(L)}_u$ accumulates gradient
information from all past interactions and captures stable
preferences such as preferred language or typical level of
detail.

The \emph{short-term vector} $z^{(S)}_{u,t} \in \mathbb{R}^k$
captures session-specific context and recency effects.
It is reset to zero at the start of each session and updated
within the session from turn-level feedback, with an
exponential decay that down-weights older signals
(Section~\ref{sec:method-rl}).

The effective user vector at time $t$ is a weighted combination
\[
  z^{\text{eff}}_{u,t}
  = \beta_L\, z^{(L)}_u + \beta_S\, z^{(S)}_{u,t},
\]
where $\beta_L, \beta_S \ge 0$ control the relative influence
of cross-session and within-session preferences.
Both vectors are learned entirely from interaction feedback;
no preference labels, pre-computed centroids, or explicit user
features are required.
```