Fix z_long definition to match code (zero-init + REINFORCE, not mean)

Paper incorrectly defined z_long as mean of item vectors. Code initializes z_long at zero and learns purely via REINFORCE. Also clarifies z_short reset-per-session behavior. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
author: YurenHao0426 <blackhao0426@gmail.com> 2026-02-11 03:28:09 +0000
committer: YurenHao0426 <blackhao0426@gmail.com> 2026-02-11 03:28:09 +0000
commit: dcc20b1f77702e5b45e2e6c08b0f243124c4676e (patch)
tree: 28a2a2c7d98202f4e93de0cdbc7412c38c9fec65
parent: 6a917d3eda85e5725c2d5ad3bf5ec9bd30262198 (diff)
1 files changed, 74 insertions, 0 deletions
diff --git a/docs/user_vector_section_fix.md b/docs/user_vector_section_fix.md
new file mode 100644
index 0000000..242ff6b
--- /dev/null
+++ b/docs/user_vector_section_fix.md
@@ -0,0 +1,74 @@
+# Fix: Section 3.5 User Vector Definition
+
+## Problem
+
+Paper defines $z^{(L)}_u$ as the mean of item vectors, but code initializes it at zero and learns purely via REINFORCE.
+
+## Change
+
+Replace the $z^{(L)}$ definition paragraph. Everything else in Section 3.5 stays the same.
+
+---
+
+## Original (replace this)
+
+```latex
+\paragraph{Long-term and short-term user vectors.}
+For each user $u$, let $H(u)$ be the set of memory cards
+associated with $u$.
+We define the long-term user vector as the mean of their item
+vectors:
+\[
+  z^{(L)}_u = \frac{1}{|H(u)|} \sum_{m \in H(u)} v_m \in \mathbb{R}^k.
+\]
+This vector captures stable preferences across sessions, such as
+preferred language or typical level of detail.
+
+We also maintain a short-term vector $z^{(S)}_{u,t} \in \mathbb{R}^k$
+that captures session-specific drift and recency effects.
+At the start of a session, $z^{(S)}_{u,0}$ is initialized to $0$
+and updated online from feedback (Section~\ref{sec:method-rl}).
+The effective user vector at time $t$ is a convex combination
+\[
+  z^{\text{eff}}_{u,t} = \beta_L z^{(L)}_u + \beta_S z^{(S)}_{u,t},
+\]
+where $\beta_L, \beta_S \ge 0$ control the relative weight of
+long- and short-term preferences.
+In our style instantiation, $z^{(L)}_u$ encodes long-run style
+tendencies (e.g., always short/Chinese), while $z^{(S)}_{u,t}$
+can adapt to transient changes within a session.
+```
+
+## Replacement
+
+```latex
+\paragraph{Long-term and short-term user vectors.}
+For each user $u$, we maintain two learned vectors in the
+item space $\mathbb{R}^k$.
+
+The \emph{long-term vector} $z^{(L)}_u \in \mathbb{R}^k$ is
+initialized to zero and updated across sessions via the
+REINFORCE rule in Section~\ref{sec:method-rl}.
+Because it is never reset, $z^{(L)}_u$ accumulates gradient
+information from all past interactions and captures stable
+preferences such as preferred language or typical level of
+detail.
+
+The \emph{short-term vector} $z^{(S)}_{u,t} \in \mathbb{R}^k$
+captures session-specific context and recency effects.
+It is reset to zero at the start of each session and updated
+within the session from turn-level feedback, with an
+exponential decay that down-weights older signals
+(Section~\ref{sec:method-rl}).
+
+The effective user vector at time $t$ is a weighted combination
+\[
+  z^{\text{eff}}_{u,t}
+  = \beta_L\, z^{(L)}_u + \beta_S\, z^{(S)}_{u,t},
+\]
+where $\beta_L, \beta_S \ge 0$ control the relative influence
+of cross-session and within-session preferences.
+Both vectors are learned entirely from interaction feedback;
+no preference labels, pre-computed centroids, or explicit user
+features are required.
+```
author	YurenHao0426 <blackhao0426@gmail.com>	2026-02-11 03:28:09 +0000
committer	YurenHao0426 <blackhao0426@gmail.com>	2026-02-11 03:28:09 +0000
commit	dcc20b1f77702e5b45e2e6c08b0f243124c4676e (patch)
tree	28a2a2c7d98202f4e93de0cdbc7412c38c9fec65
parent	6a917d3eda85e5725c2d5ad3bf5ec9bd30262198 (diff)