# Revised LaTeX: Discussion & Limitation ## Changes from original 1. All numbers updated to match 60 profiles × 60 sessions × 3 datasets 2. Removed duplicate "Beyond the aggregate table" paragraph 3. Rewrote Contextual vs RAG analysis (old narrative no longer holds) 4. Updated user-vector analysis (ρ=0.040, Q1/Q5 test, dual-vector separation) 5. Removed old variance analysis (three runs on 5 profiles) 6. Added E/T decomposition discussion 7. Added RAG timeout analysis (44.3% → 26.4% with vector) 8. Updated Limitations (60 profiles, 60 sessions, keyword reward, style-only) 9. Updated future directions (ablation studies underway, LLM judge available) --- ## Revised LaTeX ```latex \section{Discussion \& Limitations} \label{sec:discussion} \paragraph{Summary of findings.} Our experiments show that user modeling matters for multi-session personalization, but the benefits are nuanced. On the three-domain evaluation ($60$~profiles, $60$~sessions), RAG+Vector achieves the highest task success rate ($55.2\%$), but the absolute gain over Reflection ($54.4\%$) is small and not statistically significant ($p = 0.276$). The more consistent improvements appear on interaction-efficiency metrics: RAG+Vector significantly reduces timeout rate ($-2.4$~pp, $p = 0.046$) and user effort ($-13.9$ tokens, $p = 0.021$) compared to Reflection, while yielding the highest success-per-token efficiency ($2.83$ per $1{,}000$ user tokens, $+8.4\%$ vs.\ Reflection). These patterns suggest that a lightweight, frozen-backbone user model primarily improves \emph{how efficiently} the agent interacts rather than \emph{how often} it ultimately succeeds. \paragraph{The role of the user vector.} Comparing RAG ($52.0\%$) with RAG+Vector ($55.2\%$) isolates the contribution of the learned user vector, since the two methods share identical preference extraction, embedding, and reranking components. The $+3.2$~pp success gain is accompanied by a dramatic reduction in timeout rate ($44.3\% \to 26.4\%$), suggesting that the user vector helps the retrieval policy prioritize the most relevant preferences and avoid flooding the agent with tangential or conflicting memory cards. Without the vector, the RAG method suffers from \emph{preference overload}: as the memory grows over sessions, the retrieval system surfaces an increasingly noisy set of preferences, which leads the agent to either hedge (by asking clarification questions) or produce unfocused responses. The user vector mitigates this by biasing retrieval toward preferences that have historically led to positive user feedback for this particular user. \paragraph{Comparison with Reflection.} The Reflection baseline generates a reasoning-heavy session-level summary after each conversation and appends it to future prompts, providing an implicit form of cross-session memory. RAG+Vector matches Reflection in task success while using $6.7\%$ fewer user tokens per session. This efficiency advantage is practically meaningful in deployment scenarios where user patience and token budgets are limited. Under a frozen backbone, a relatively cheap combination of preference memory and user vectors can match the performance of a reflection-style agent that incurs an extra LLM pass per session. In our implementation, the Reflection baseline uses prompted reflections without GRPO fine-tuning; reproducing the GRPO-trained reflection agents from the original \textsc{MultiSessionCollab} paper remains future work and would likely raise the Reflection ceiling further. \paragraph{Preference format and agent compliance.} An important factor in RAG-based methods is the \emph{format} in which retrieved preferences are injected into the agent prompt. Our preference extractor stores preferences as structured condition--action rules (e.g., \texttt{"condition": "math problems", "action": "show step-by-step work"}). We find that directly injecting these structured rules into the prompt of an 8B-parameter agent often fails to elicit compliance, even when the correct preferences are retrieved: the agent appears to treat the structured format as metadata rather than actionable instructions. By contrast, Reflection's session-level summaries are expressed in natural language and are more readily followed by the agent. This representational mismatch is compounded by the fact that we adopt the agent prompt template from \textsc{MultiSessionCollab}, which was specifically designed and tuned for Reflection-style notes rather than structured preference cards, further disadvantaging RAG-based methods. To address this, we introduce a lightweight \emph{rewrite} module that uses the agent LLM to merge the top-$k$ retrieved preference cards into one or two fluent natural-language instructions before prompt injection. In a variant experiment (RAG+Rewrite), this step improves task success by $+0.8$~pp over plain RAG ($51.6\% \to 52.4\%$) and reduces timeout rate by $1.4$~pp ($25.8\% \to 24.4\%$), closing roughly half of the gap between RAG and Reflection ($53.3\%$). These results suggest that preference compliance is bottlenecked not only by retrieval quality but also by how preferences are \emph{presented} to the agent, and that bridging this format gap is a promising direction for further improvement. \paragraph{Why does Vanilla perform well?} A perhaps surprising finding is that Vanilla ($54.3\%$) nearly matches Reflection ($54.4\%$), despite having no cross-session memory at all. This reflects the difficulty of the evaluation setting: with three diverse task domains and $43$ style preferences per profile, even a memory-augmented agent cannot reliably satisfy all preferences, and the user simulator's preference enforcement behavior introduces substantial variance. The result also suggests that a meaningful portion of task success is driven by the agent's intrinsic problem-solving ability rather than preference compliance, and that current personalization methods yield incremental rather than transformative gains on this benchmark. \paragraph{Enforcement rate and interaction dynamics.} RAG+Vector exhibits a slightly higher enforcement-per-turn rate ($0.186$) than Reflection ($0.175$), which might initially appear counterproductive. However, decomposing this metric reveals that $79\%$ of the gap comes from a marginal increase in total enforcements per session ($1.54$ vs.\ $1.47$, $p = 0.058$), while RAG+Vector simultaneously completes sessions in fewer turns ($8.31$ vs.\ $8.41$). Since RAG+Vector achieves comparable success with fewer turns and significantly lower user token cost, the slightly elevated enforcement rate does not translate into worse interaction quality. We hypothesize that RAG's explicit preference injection prompts the user simulator to give more targeted, preference-specific feedback, whereas Reflection's summarized notes lead to broader but less pointed enforcement signals. \paragraph{User-vector geometry.} Our analysis of the learned long-term vectors $z^{(L)}_u$ reveals that they encode preference-aware structure rather than random noise. Users with more similar revealed preferences develop more similar long-term vectors: the top-$25\%$ of user pairs by preference Jaccard overlap have a mean vector cosine of $+0.012$, versus $-0.027$ for the bottom-$25\%$ ($p = 0.021$, Mann--Whitney $U$). The dual-vector design successfully separates stable preference identity from session-specific context: $z^{(L)}$ significantly associates with long-term preference overlap ($p = 0.006$), while $z^{(S)}$ shows no such association ($p = 0.586$). This confirms the intended division of labor---$z^{(L)}$ captures enduring user identity, while $z^{(S)}$ absorbs transient within-session signals and decays between sessions. At the same time, the global Spearman correlation between Jaccard overlap and vector cosine similarity is modest ($\rho = 0.040$, $p = 0.093$), indicating that the REINFORCE updates are effective directionally but noisy. With only $60$ sessions of weak feedback per user, the learned vectors capture coarse preference structure but lack the resolution to distinguish fine-grained preference differences. Longer interaction horizons and stronger reward signals may improve this resolution. \paragraph{Failure modes and what personalization does not improve.} Three recurring failure modes limit all methods: (i) the agent sometimes becomes overly passive, asking clarification questions instead of solving the task; (ii) as preferences accumulate, the agent attempts to satisfy all simultaneously, producing verbose or unfocused responses; and (iii) the user simulator occasionally disengages early when initial responses are weak. The user vector partially mitigates~(ii)---RAG+Vector's timeout rate ($26.4\%$) is far lower than RAG's ($44.3\%$)---but does not address~(i) or~(iii), which stem from the agent's generation strategy and the simulator's tolerance, respectively. Closing this gap likely requires tighter coupling between the reward design and the conversation strategy (e.g., explicitly rewarding decisive, preference-consistent solutions rather than neutral clarification turns), as well as memory management mechanisms that down-weight stale or redundant preferences. \paragraph{Limitations.} Several limitations scope the conclusions of this study. \emph{Scale and generalization.} We evaluate on $60$ profiles drawn from a pool of $200$, with $60$ sessions per profile. While this yields $3{,}600$ sessions per method, it remains a single benchmark with LLM-simulated users. Extending to real users, longer interaction horizons, and non-style preferences (e.g., topical interests, tool choices, safety constraints) is necessary to assess generalizability. \emph{Reward signal.} The current system uses a keyword-based heuristic to classify user follow-ups into reward categories. While fast and requiring no additional model, this heuristic may miss subtle feedback cues. We have implemented and validated an LLM-as-judge alternative (Llama-3.1-8B-Instruct, $83$--$92\%$ agreement with GPT-based judges) that can serve as a drop-in replacement for richer reward estimation in future experiments. \emph{Baseline strength.} Our Reflection baseline uses prompted reflections without GRPO fine-tuning, which may underestimate the ceiling of reflection-based methods as reported in the original \textsc{MultiSessionCollab} paper. \emph{Hyperparameter sensitivity.} Learning rates ($\eta_L$, $\eta_S$), scoring weights ($\beta_L$, $\beta_S$), decay rate ($\lambda$), and retrieval parameters (top-$k$, reranker threshold) are set heuristically. Systematic hyperparameter sweeps and ablation studies are needed to understand sensitivity. \paragraph{Future directions.} Three extensions are especially promising. First, \emph{component ablations}: we are conducting ablation experiments that disable either $z^{(S)}$ (short-term vector) or $z^{(L)}$ (long-term vector) to quantify each component's contribution to task success and interaction efficiency. Second, \emph{stronger reward signals}: swapping in the LLM-based reward model and studying how reward quality affects user-vector learning dynamics. Third, \emph{scaling up}: extending to all $200$ profiles and longer session horizons to test whether a plug-in approach---RAG plus user modeling, without per-user fine-tuning---can sustain its advantages at larger scale and whether the user-vector geometry becomes more structured with more interaction data. ```