1 files changed, 222 insertions, 0 deletions
diff --git a/docs/discussion_revised.md b/docs/discussion_revised.md
new file mode 100644
index 0000000..5d02dc6
--- /dev/null
+++ b/docs/discussion_revised.md
@@ -0,0 +1,222 @@
+# Revised LaTeX: Discussion & Limitation
+
+## Changes from original
+
+1. All numbers updated to match 60 profiles × 60 sessions × 3 datasets
+2. Removed duplicate "Beyond the aggregate table" paragraph
+3. Rewrote Contextual vs RAG analysis (old narrative no longer holds)
+4. Updated user-vector analysis (ρ=0.040, Q1/Q5 test, dual-vector separation)
+5. Removed old variance analysis (three runs on 5 profiles)
+6. Added E/T decomposition discussion
+7. Added RAG timeout analysis (44.3% → 26.4% with vector)
+8. Updated Limitations (60 profiles, 60 sessions, keyword reward, style-only)
+9. Updated future directions (ablation studies underway, LLM judge available)
+
+---
+
+## Revised LaTeX
+
+```latex
+\section{Discussion \& Limitations}
+\label{sec:discussion}
+
+\paragraph{Summary of findings.}
+Our experiments show that user modeling matters for
+multi-session personalization, but the benefits are
+nuanced.
+On the three-domain evaluation ($60$~profiles,
+$60$~sessions), RAG+Vector achieves the highest task
+success rate ($55.2\%$), but the absolute gain over
+Reflection ($54.4\%$) is small and not statistically
+significant ($p = 0.276$).
+The more consistent improvements appear on
+interaction-efficiency metrics: RAG+Vector significantly
+reduces timeout rate ($-2.4$~pp, $p = 0.046$) and user
+effort ($-13.9$ tokens, $p = 0.021$) compared to
+Reflection, while yielding the highest success-per-token
+efficiency ($2.83$ per $1{,}000$ user tokens, $+8.4\%$
+vs.\ Reflection).
+These patterns suggest that a lightweight, frozen-backbone
+user model primarily improves \emph{how efficiently} the
+agent interacts rather than \emph{how often} it ultimately
+succeeds.
+
+\paragraph{The role of the user vector.}
+Comparing RAG ($52.0\%$) with RAG+Vector ($55.2\%$)
+isolates the contribution of the learned user vector,
+since the two methods share identical preference extraction,
+embedding, and reranking components.
+The $+3.2$~pp success gain is accompanied by a dramatic
+reduction in timeout rate ($44.3\% \to 26.4\%$), suggesting
+that the user vector helps the retrieval policy prioritize
+the most relevant preferences and avoid flooding the agent
+with tangential or conflicting memory cards.
+Without the vector, the RAG method suffers from
+\emph{preference overload}: as the memory grows over
+sessions, the retrieval system surfaces an increasingly
+noisy set of preferences, which leads the agent to either
+hedge (by asking clarification questions) or produce
+unfocused responses.
+The user vector mitigates this by biasing retrieval toward
+preferences that have historically led to positive user
+feedback for this particular user.
+
+\paragraph{Comparison with Reflection.}
+The Reflection baseline generates a reasoning-heavy
+session-level summary after each conversation and appends
+it to future prompts, providing an implicit form of
+cross-session memory.
+RAG+Vector matches Reflection in task success while using
+$6.7\%$ fewer user tokens per session.
+This efficiency advantage is practically meaningful in
+deployment scenarios where user patience and token budgets
+are limited.
+Under a frozen backbone, a relatively cheap combination of
+preference memory and user vectors can match the performance
+of a reflection-style agent that incurs an extra LLM pass
+per session.
+In our implementation, the Reflection baseline uses prompted
+reflections without GRPO fine-tuning; reproducing the
+GRPO-trained reflection agents from the original
+\textsc{MultiSessionCollab} paper remains future work and
+would likely raise the Reflection ceiling further.
+
+\paragraph{Why does Vanilla perform well?}
+A perhaps surprising finding is that Vanilla ($54.3\%$)
+nearly matches Reflection ($54.4\%$), despite having no
+cross-session memory at all.
+This reflects the difficulty of the evaluation setting:
+with three diverse task domains and $43$ style preferences
+per profile, even a memory-augmented agent cannot reliably
+satisfy all preferences, and the user simulator's preference
+enforcement behavior introduces substantial variance.
+The result also suggests that a meaningful portion of task
+success is driven by the agent's intrinsic problem-solving
+ability rather than preference compliance, and that current
+personalization methods yield incremental rather than
+transformative gains on this benchmark.
+
+\paragraph{Enforcement rate and interaction dynamics.}
+RAG+Vector exhibits a slightly higher enforcement-per-turn
+rate ($0.186$) than Reflection ($0.175$), which might
+initially appear counterproductive.
+However, decomposing this metric reveals that $79\%$ of the
+gap comes from a marginal increase in total enforcements per
+session ($1.54$ vs.\ $1.47$, $p = 0.058$), while RAG+Vector
+simultaneously completes sessions in fewer turns ($8.31$ vs.\
+$8.41$).
+Since RAG+Vector achieves comparable success with fewer
+turns and significantly lower user token cost, the slightly
+elevated enforcement rate does not translate into worse
+interaction quality.
+We hypothesize that RAG's explicit preference injection
+prompts the user simulator to give more targeted,
+preference-specific feedback, whereas Reflection's
+summarized notes lead to broader but less pointed
+enforcement signals.
+
+\paragraph{User-vector geometry.}
+Our analysis of the learned long-term vectors $z^{(L)}_u$
+reveals that they encode preference-aware structure rather
+than random noise.
+Users with more similar revealed preferences develop more
+similar long-term vectors: the top-$25\%$ of user pairs by
+preference Jaccard overlap have a mean vector cosine of
+$+0.012$, versus $-0.027$ for the bottom-$25\%$
+($p = 0.021$, Mann--Whitney $U$).
+The dual-vector design successfully separates stable
+preference identity from session-specific context:
+$z^{(L)}$ significantly associates with long-term
+preference overlap ($p = 0.006$), while $z^{(S)}$ shows
+no such association ($p = 0.586$).
+This confirms the intended division of labor---$z^{(L)}$
+captures enduring user identity, while $z^{(S)}$ absorbs
+transient within-session signals and decays between sessions.
+
+At the same time, the global Spearman correlation between
+Jaccard overlap and vector cosine similarity is modest
+($\rho = 0.040$, $p = 0.093$), indicating that the
+REINFORCE updates are effective directionally but noisy.
+With only $60$ sessions of weak feedback per user, the
+learned vectors capture coarse preference structure but
+lack the resolution to distinguish fine-grained preference
+differences.
+Longer interaction horizons and stronger reward signals
+may improve this resolution.
+
+\paragraph{Failure modes and what personalization does not improve.}
+Three recurring failure modes limit all methods:
+(i) the agent sometimes becomes overly passive, asking
+clarification questions instead of solving the task;
+(ii) as preferences accumulate, the agent attempts to
+satisfy all simultaneously, producing verbose or unfocused
+responses; and
+(iii) the user simulator occasionally disengages early
+when initial responses are weak.
+The user vector partially mitigates~(ii)---RAG+Vector's
+timeout rate ($26.4\%$) is far lower than RAG's
+($44.3\%$)---but does not address~(i) or~(iii), which
+stem from the agent's generation strategy and the
+simulator's tolerance, respectively.
+Closing this gap likely requires tighter coupling between
+the reward design and the conversation strategy
+(e.g., explicitly rewarding decisive, preference-consistent
+solutions rather than neutral clarification turns), as well
+as memory management mechanisms that down-weight stale or
+redundant preferences.
+
+\paragraph{Limitations.}
+Several limitations scope the conclusions of this study.
+
+\emph{Scale and generalization.}
+We evaluate on $60$ profiles drawn from a pool of $200$,
+with $60$ sessions per profile.
+While this yields $3{,}600$ sessions per method, it remains
+a single benchmark with LLM-simulated users.
+Extending to real users, longer interaction horizons, and
+non-style preferences (e.g., topical interests, tool
+choices, safety constraints) is necessary to assess
+generalizability.
+
+\emph{Reward signal.}
+The current system uses a keyword-based heuristic to
+classify user follow-ups into reward categories.
+While fast and requiring no additional model, this
+heuristic may miss subtle feedback cues.
+We have implemented and validated an LLM-as-judge
+alternative (Llama-3.1-8B-Instruct, $83$--$92\%$
+agreement with GPT-based judges) that can serve as a
+drop-in replacement for richer reward estimation in future
+experiments.
+
+\emph{Baseline strength.}
+Our Reflection baseline uses prompted reflections without
+GRPO fine-tuning, which may underestimate the ceiling of
+reflection-based methods as reported in the original
+\textsc{MultiSessionCollab} paper.
+
+\emph{Hyperparameter sensitivity.}
+Learning rates ($\eta_L$, $\eta_S$), scoring weights
+($\beta_L$, $\beta_S$), decay rate ($\lambda$), and
+retrieval parameters (top-$k$, reranker threshold) are
+set heuristically.
+Systematic hyperparameter sweeps and ablation studies
+are needed to understand sensitivity.
+
+\paragraph{Future directions.}
+Three extensions are especially promising.
+First, \emph{component ablations}: we are conducting
+ablation experiments that disable either $z^{(S)}$
+(short-term vector) or $z^{(L)}$ (long-term vector) to
+quantify each component's contribution to task success and
+interaction efficiency.
+Second, \emph{stronger reward signals}: swapping in the
+LLM-based reward model and studying how reward quality
+affects user-vector learning dynamics.
+Third, \emph{scaling up}: extending to all $200$ profiles
+and longer session horizons to test whether a plug-in
+approach---RAG plus user modeling, without per-user
+fine-tuning---can sustain its advantages at larger scale
+and whether the user-vector geometry becomes more
+structured with more interaction data.
+```