docs/conclusion_revised.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68

# Revised LaTeX: Conclusion

## Changes from original

1. All numbers updated (54.3%→55.2%, not 57%→71%; 60 profiles not 5; 3 datasets not 1)
2. Token claim corrected: 16.9% reduction, not "halving"
3. User-vector correlation updated: modest ρ but significant quartile test
4. Scale description updated: 60 profiles from 200, 60 sessions
5. Future work updated: ablation underway, LLM judge ready, 200 profiles target

---

## Revised LaTeX

```latex
\section{Conclusion}
\label{sec:conclusion}

We have presented a frozen-backbone personalization framework
that combines structured preference extraction, a
retrieval-augmented preference memory, and online user vectors
updated via lightweight reinforcement learning.
Rather than fine-tuning the base LLM for each user, the
framework keeps all backbone models (chat, embedding, reranker)
fixed and introduces personalization only through an external
layer that represents user preferences as condition--action
rules and aggregates them into a low-dimensional user vector.
A key design choice is to treat feedback as a scalar reward,
making the user-vector update agnostic to the specific feedback
channel and, in principle, compatible with both explicit and
implicit signals.

We instantiated this framework on the
\textsc{MultiSessionCollab} benchmark across three task domains
(math-hard, math-500, bigcodebench) with an LLM-based user
simulator and style-oriented user profiles.
Under a frozen 8B backbone evaluated over $60$ profiles and
$60$ sessions ($3{,}600$ sessions per method), our
\textbf{RAG+Vector} method achieves the highest task success
rate ($55.2\%$) among all six system modes, significantly
reduces timeout rate ($26.4\%$ vs.\ $28.8\%$, $p = 0.046$)
and user effort ($193.6$ vs.\ $207.5$ tokens, $p = 0.021$)
compared to the Reflection baseline, and attains the highest
interaction efficiency ($2.83$ successes per $1{,}000$ user
tokens).
Analysis of the learned user vectors confirms that the
dual-vector design separates stable user identity from
session-specific context: long-term vectors $z^{(L)}$
significantly associate with preference overlap across users
($p = 0.006$), while short-term vectors $z^{(S)}$ do not
($p = 0.586$), consistent with the intended division of labor.

Our study evaluates on $60$ profiles drawn from a pool of
$200$, focuses on style preferences, and compares against a
prompted Reflection baseline rather than the GRPO-trained
agents from the original \textsc{MultiSessionCollab} paper.
Several extensions are underway or planned: component
ablations that isolate the contribution of each user-vector
component, deployment of an LLM-based reward model to
replace the current keyword heuristic, and scaling
experiments to all $200$ profiles.
We view this work as a case study demonstrating that
lightweight user modeling on top of RAG---without any
per-user fine-tuning of the backbone---can yield measurable
efficiency gains in online, multi-session personalization,
and as a starting point for larger-scale investigations of
long-term adaptation with frozen LLM backbones.
```