diff options
| author | YurenHao0426 <blackhao0426@gmail.com> | 2026-02-11 02:59:00 +0000 |
|---|---|---|
| committer | YurenHao0426 <blackhao0426@gmail.com> | 2026-02-11 02:59:00 +0000 |
| commit | 8e9248a361b0b3fb4f3dbd37a14b1028afeb3015 (patch) | |
| tree | 763be3c2272308b8400d411b54b4e18cb893e0c0 /docs | |
| parent | 7618450230f8938b86a8b3521c855add99f74ff2 (diff) | |
Add revised discussion & limitations LaTeX section
Complete rewrite with current data (60 profiles × 60 sessions):
- Updated all numbers and removed stale references
- Removed duplicate paragraph
- Added: user vector role analysis (RAG 44.3% → RAG+Vec 26.4% timeout)
- Added: E/T decomposition (79% from enforcements, not negative)
- Added: why Vanilla performs well discussion
- Updated: user-vector geometry (ρ=0.040, dual-vector separation)
- Updated: limitations (keyword reward, no GRPO, 60 profiles)
- Updated: future directions (ablation underway, LLM judge ready)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Diffstat (limited to 'docs')
| -rw-r--r-- | docs/discussion_revised.md | 222 |
1 files changed, 222 insertions, 0 deletions
diff --git a/docs/discussion_revised.md b/docs/discussion_revised.md new file mode 100644 index 0000000..5d02dc6 --- /dev/null +++ b/docs/discussion_revised.md @@ -0,0 +1,222 @@ +# Revised LaTeX: Discussion & Limitation + +## Changes from original + +1. All numbers updated to match 60 profiles × 60 sessions × 3 datasets +2. Removed duplicate "Beyond the aggregate table" paragraph +3. Rewrote Contextual vs RAG analysis (old narrative no longer holds) +4. Updated user-vector analysis (ρ=0.040, Q1/Q5 test, dual-vector separation) +5. Removed old variance analysis (three runs on 5 profiles) +6. Added E/T decomposition discussion +7. Added RAG timeout analysis (44.3% → 26.4% with vector) +8. Updated Limitations (60 profiles, 60 sessions, keyword reward, style-only) +9. Updated future directions (ablation studies underway, LLM judge available) + +--- + +## Revised LaTeX + +```latex +\section{Discussion \& Limitations} +\label{sec:discussion} + +\paragraph{Summary of findings.} +Our experiments show that user modeling matters for +multi-session personalization, but the benefits are +nuanced. +On the three-domain evaluation ($60$~profiles, +$60$~sessions), RAG+Vector achieves the highest task +success rate ($55.2\%$), but the absolute gain over +Reflection ($54.4\%$) is small and not statistically +significant ($p = 0.276$). +The more consistent improvements appear on +interaction-efficiency metrics: RAG+Vector significantly +reduces timeout rate ($-2.4$~pp, $p = 0.046$) and user +effort ($-13.9$ tokens, $p = 0.021$) compared to +Reflection, while yielding the highest success-per-token +efficiency ($2.83$ per $1{,}000$ user tokens, $+8.4\%$ +vs.\ Reflection). +These patterns suggest that a lightweight, frozen-backbone +user model primarily improves \emph{how efficiently} the +agent interacts rather than \emph{how often} it ultimately +succeeds. + +\paragraph{The role of the user vector.} +Comparing RAG ($52.0\%$) with RAG+Vector ($55.2\%$) +isolates the contribution of the learned user vector, +since the two methods share identical preference extraction, +embedding, and reranking components. +The $+3.2$~pp success gain is accompanied by a dramatic +reduction in timeout rate ($44.3\% \to 26.4\%$), suggesting +that the user vector helps the retrieval policy prioritize +the most relevant preferences and avoid flooding the agent +with tangential or conflicting memory cards. +Without the vector, the RAG method suffers from +\emph{preference overload}: as the memory grows over +sessions, the retrieval system surfaces an increasingly +noisy set of preferences, which leads the agent to either +hedge (by asking clarification questions) or produce +unfocused responses. +The user vector mitigates this by biasing retrieval toward +preferences that have historically led to positive user +feedback for this particular user. + +\paragraph{Comparison with Reflection.} +The Reflection baseline generates a reasoning-heavy +session-level summary after each conversation and appends +it to future prompts, providing an implicit form of +cross-session memory. +RAG+Vector matches Reflection in task success while using +$6.7\%$ fewer user tokens per session. +This efficiency advantage is practically meaningful in +deployment scenarios where user patience and token budgets +are limited. +Under a frozen backbone, a relatively cheap combination of +preference memory and user vectors can match the performance +of a reflection-style agent that incurs an extra LLM pass +per session. +In our implementation, the Reflection baseline uses prompted +reflections without GRPO fine-tuning; reproducing the +GRPO-trained reflection agents from the original +\textsc{MultiSessionCollab} paper remains future work and +would likely raise the Reflection ceiling further. + +\paragraph{Why does Vanilla perform well?} +A perhaps surprising finding is that Vanilla ($54.3\%$) +nearly matches Reflection ($54.4\%$), despite having no +cross-session memory at all. +This reflects the difficulty of the evaluation setting: +with three diverse task domains and $43$ style preferences +per profile, even a memory-augmented agent cannot reliably +satisfy all preferences, and the user simulator's preference +enforcement behavior introduces substantial variance. +The result also suggests that a meaningful portion of task +success is driven by the agent's intrinsic problem-solving +ability rather than preference compliance, and that current +personalization methods yield incremental rather than +transformative gains on this benchmark. + +\paragraph{Enforcement rate and interaction dynamics.} +RAG+Vector exhibits a slightly higher enforcement-per-turn +rate ($0.186$) than Reflection ($0.175$), which might +initially appear counterproductive. +However, decomposing this metric reveals that $79\%$ of the +gap comes from a marginal increase in total enforcements per +session ($1.54$ vs.\ $1.47$, $p = 0.058$), while RAG+Vector +simultaneously completes sessions in fewer turns ($8.31$ vs.\ +$8.41$). +Since RAG+Vector achieves comparable success with fewer +turns and significantly lower user token cost, the slightly +elevated enforcement rate does not translate into worse +interaction quality. +We hypothesize that RAG's explicit preference injection +prompts the user simulator to give more targeted, +preference-specific feedback, whereas Reflection's +summarized notes lead to broader but less pointed +enforcement signals. + +\paragraph{User-vector geometry.} +Our analysis of the learned long-term vectors $z^{(L)}_u$ +reveals that they encode preference-aware structure rather +than random noise. +Users with more similar revealed preferences develop more +similar long-term vectors: the top-$25\%$ of user pairs by +preference Jaccard overlap have a mean vector cosine of +$+0.012$, versus $-0.027$ for the bottom-$25\%$ +($p = 0.021$, Mann--Whitney $U$). +The dual-vector design successfully separates stable +preference identity from session-specific context: +$z^{(L)}$ significantly associates with long-term +preference overlap ($p = 0.006$), while $z^{(S)}$ shows +no such association ($p = 0.586$). +This confirms the intended division of labor---$z^{(L)}$ +captures enduring user identity, while $z^{(S)}$ absorbs +transient within-session signals and decays between sessions. + +At the same time, the global Spearman correlation between +Jaccard overlap and vector cosine similarity is modest +($\rho = 0.040$, $p = 0.093$), indicating that the +REINFORCE updates are effective directionally but noisy. +With only $60$ sessions of weak feedback per user, the +learned vectors capture coarse preference structure but +lack the resolution to distinguish fine-grained preference +differences. +Longer interaction horizons and stronger reward signals +may improve this resolution. + +\paragraph{Failure modes and what personalization does not improve.} +Three recurring failure modes limit all methods: +(i) the agent sometimes becomes overly passive, asking +clarification questions instead of solving the task; +(ii) as preferences accumulate, the agent attempts to +satisfy all simultaneously, producing verbose or unfocused +responses; and +(iii) the user simulator occasionally disengages early +when initial responses are weak. +The user vector partially mitigates~(ii)---RAG+Vector's +timeout rate ($26.4\%$) is far lower than RAG's +($44.3\%$)---but does not address~(i) or~(iii), which +stem from the agent's generation strategy and the +simulator's tolerance, respectively. +Closing this gap likely requires tighter coupling between +the reward design and the conversation strategy +(e.g., explicitly rewarding decisive, preference-consistent +solutions rather than neutral clarification turns), as well +as memory management mechanisms that down-weight stale or +redundant preferences. + +\paragraph{Limitations.} +Several limitations scope the conclusions of this study. + +\emph{Scale and generalization.} +We evaluate on $60$ profiles drawn from a pool of $200$, +with $60$ sessions per profile. +While this yields $3{,}600$ sessions per method, it remains +a single benchmark with LLM-simulated users. +Extending to real users, longer interaction horizons, and +non-style preferences (e.g., topical interests, tool +choices, safety constraints) is necessary to assess +generalizability. + +\emph{Reward signal.} +The current system uses a keyword-based heuristic to +classify user follow-ups into reward categories. +While fast and requiring no additional model, this +heuristic may miss subtle feedback cues. +We have implemented and validated an LLM-as-judge +alternative (Llama-3.1-8B-Instruct, $83$--$92\%$ +agreement with GPT-based judges) that can serve as a +drop-in replacement for richer reward estimation in future +experiments. + +\emph{Baseline strength.} +Our Reflection baseline uses prompted reflections without +GRPO fine-tuning, which may underestimate the ceiling of +reflection-based methods as reported in the original +\textsc{MultiSessionCollab} paper. + +\emph{Hyperparameter sensitivity.} +Learning rates ($\eta_L$, $\eta_S$), scoring weights +($\beta_L$, $\beta_S$), decay rate ($\lambda$), and +retrieval parameters (top-$k$, reranker threshold) are +set heuristically. +Systematic hyperparameter sweeps and ablation studies +are needed to understand sensitivity. + +\paragraph{Future directions.} +Three extensions are especially promising. +First, \emph{component ablations}: we are conducting +ablation experiments that disable either $z^{(S)}$ +(short-term vector) or $z^{(L)}$ (long-term vector) to +quantify each component's contribution to task success and +interaction efficiency. +Second, \emph{stronger reward signals}: swapping in the +LLM-based reward model and studying how reward quality +affects user-vector learning dynamics. +Third, \emph{scaling up}: extending to all $200$ profiles +and longer session horizons to test whether a plug-in +approach---RAG plus user modeling, without per-user +fine-tuning---can sustain its advantages at larger scale +and whether the user-vector geometry becomes more +structured with more interaction data. +``` |
