diff options
Diffstat (limited to 'paper')
| -rw-r--r-- | paper/uph_paper.tex | 302 |
1 files changed, 302 insertions, 0 deletions
diff --git a/paper/uph_paper.tex b/paper/uph_paper.tex new file mode 100644 index 0000000..e3e4208 --- /dev/null +++ b/paper/uph_paper.tex @@ -0,0 +1,302 @@ +% UPH: A Single-Vector User Prior for Memory-Free Personalization of Frozen LLMs +% ACM UMAP 2026 Late-Breaking Results + +\documentclass[sigconf,authordraft]{acmart} + +\AtBeginDocument{% + \providecommand\BibTeX{{% + Bib\TeX}}} + +\setcopyright{acmlicensed} +\copyrightyear{2026} +\acmYear{2026} +\acmDOI{XXXXXXX.XXXXXXX} +\acmConference[UMAP '26]{Proceedings of the 34th ACM Conference on User Modeling, Adaptation and Personalization}{June 23--27, 2026}{Gothenburg, Sweden} +\acmISBN{978-1-4503-XXXX-X/2026/06} + +\usepackage{amsmath} +\usepackage{tabularx} +\usepackage{graphicx} +\newcolumntype{C}{>{\centering\arraybackslash}X} + +\begin{document} + +\title{A Single-Vector User Prior for Extreme Few-Shot Memory-Free Personalization of Frozen LLMs} + +\author{Yuren Hao} +\affiliation{% + \institution{University of Illinois Urbana-Champaign} + \city{Urbana, IL} + \country{USA}} +\email{yurenha2@illinois.edu} + +\author{ChengXiang Zhai} +\orcid{0000-0002-6434-3702} +\affiliation{% + \institution{University of Illinois Urbana-Champaign} + \city{Urbana, IL} + \country{USA}} +\email{czhai@illinois.edu} + +\renewcommand{\shortauthors}{Hao and Zhai} + +\begin{abstract} +Personalizing large language model (LLM) generation typically relies on injecting user history into the prompt at inference time, incurring context cost and potential brittleness when support exemplars poorly match the current query. +We ask whether user history can instead be compressed into a single vector that acts as a generation prior on a frozen LLM, without any inference-time textual user memory. +We propose the \emph{User Prior Head} (UPH), which adds a static user-specific bias to the model's final hidden states: $\mathbf{h}'_t = \mathbf{h}_t + \mathbf{U}\boldsymbol{\theta}_u$, where $\boldsymbol{\theta}_u$ is a 64-dimensional vector (128 bytes in bfloat16) fitted from each user's support history in ${\sim}7$~seconds on a single GPU. +On LongLaMP Review Writing and Topic Writing with $K{=}4$ support examples per user, UPH improves ROUGE-L by ${\sim}11\%$ on average over the unpersonalized baseline while using zero personalized prompt tokens at inference. +On Topic Writing---where support--query mismatch is high---UPH is the best method overall, significantly outperforming retrieval-based textual baselines (Dense-Top1, BM25-Top1) and parameter-efficient fine-tuning methods (LoRA, VeRA, Prompt/Prefix Tuning), all of which either require hundreds of prompt tokens or orders of magnitude more per-user state. +An ablation over $K \in \{4, 8, 16\}$ and user-vector dimension $d \in \{8, 16, 32, 64, 128\}$ confirms that the effect is stable and that a 16-byte user prior ($d{=}8$) already recovers most of the gain. +Our results suggest that a compact user prior can serve as an effective alternative to inference-time textual personalization in many settings. +\end{abstract} + +\begin{CCSXML} +<ccs2012> + <concept> + <concept_id>10002951.10003317.10003347.10003350</concept_id> + <concept_desc>Information systems~Recommender systems</concept_desc> + <concept_significance>500</concept_significance> + </concept> + <concept> + <concept_id>10010147.10010178.10010179</concept_id> + <concept_desc>Computing methodologies~Natural language generation</concept_desc> + <concept_significance>500</concept_significance> + </concept> + <concept> + <concept_id>10002951.10003317.10003338</concept_id> + <concept_desc>Information systems~Users and interactive retrieval</concept_desc> + <concept_significance>300</concept_significance> + </concept> +</ccs2012> +\end{CCSXML} + +\ccsdesc[500]{Information systems~Recommender systems} +\ccsdesc[500]{Computing methodologies~Natural language generation} +\ccsdesc[300]{Information systems~Users and interactive retrieval} + +\keywords{personalization, user modeling, language models, user prior, parameter-efficient adaptation} + +\maketitle + +%% Figure 1 defined early so it floats to top of page 2 +\begin{figure*}[t] + \centering + \includegraphics[width=\textwidth]{uph.drawio (1).png} + \caption{Four personalization paradigms for frozen or adapted language models. Text-memory methods carry user history as inference-time prompts; PEFT methods personalize via per-user trainable adapters; UPH compresses user history into a single per-user vector prior $\boldsymbol{\theta}_u$ that additively biases the frozen model's final hidden states, enabling memory-free personalization with zero personalized prompt tokens.} + \Description{Four neural network diagrams comparing: no personalization with frozen backbone, memory-based with user history as prompts, PEFT-based with unfrozen backbone, and UPH-based with frozen backbone and single vector prior.} + \label{fig:paradigms} +\end{figure*} + +%% ============================================================ +%% SECTION 1: Introduction +%% ============================================================ + +\section{Introduction} + +Personalizing LLM generation typically relies on user history being present as text in the prompt at inference time. +A common strategy concatenates a user's past writings directly into the prompt, while retrieval-based variants first select the most relevant exemplar via BM25 or dense similarity. +While effective when good exemplars are available, this paradigm has two systemic issues. +First, it incurs \emph{context cost}: each personalized query must carry hundreds of additional prompt tokens. +Second, it can be \emph{brittle}: when the user's history poorly matches the current query---for example, when a reviewer of cookbooks is asked to write about an unfamiliar electronics product, or a Reddit user who writes about sports is asked to write about a novel topic---the injected exemplars may provide limited benefit or even degrade output quality. + +Parameter-efficient fine-tuning (PEFT) methods such as LoRA, VeRA, Prompt Tuning and Prefix Tuning offer an alternative in which each user's state is a small set of learned model-side parameters. +At the extreme few-shot scale we target ($K{=}4$ writing samples per user), however, these methods either overfit or fail to learn anything at all: even rank-1 LoRA has $>10^5$ parameters per user, and Prompt/Prefix Tuning over a 1.5B-parameter backbone produce near-random outputs (Table~\ref{tab:main}). + +We ask a narrower question: \emph{can user history be compressed into a single vector that acts as a generation prior on a frozen LLM, without any inference-time textual user memory?} +If so, personalization reduces to storing one small vector per user and applying it at decode time, with no retrieval, no prompt augmentation, and no per-user model fine-tuning. + +Concretely, we propose the \textbf{User Prior Head (UPH)}, which modifies the final hidden states of a frozen LLM via a static, user-specific additive bias: +\begin{equation} + \mathbf{h}'_t = \mathbf{h}_t + \alpha\, \mathbf{U}\boldsymbol{\theta}_u, + \label{eq:uph} +\end{equation} +where $\mathbf{U} \in \mathbb{R}^{H \times d}$ is a shared projection matrix (fixed after random initialization), $\boldsymbol{\theta}_u \in \mathbb{R}^{d}$ is a per-user vector fitted from that user's support history, $H$ is the hidden dimension ($1536$ for Qwen2.5-1.5B), $d{=}64$ by default, and $\alpha{=}0.1$ is a scaling constant. + +\paragraph{Fitting $\boldsymbol{\theta}_u$.} +For each user $u$, we optimize $\boldsymbol{\theta}_u$ to minimize the token-level cross-entropy loss on the user's $K$ support examples, regularized by a KL-divergence penalty toward the base model distribution and an $\ell_2$ penalty on $\boldsymbol{\theta}_u$. +Optimization requires only ${\sim}30$ Adam steps per user (${\sim}7$~seconds on a single GPU) and touches no LLM parameters. +The result is a compact, offline-precomputed user state that can be loaded at serving time with negligible overhead. + +\paragraph{Contributions.} +Our contribution is not a more expressive personalization module, but the empirical and systems claim that personalization can be reduced to a tiny user prior. +\begin{enumerate} + \item We formulate \emph{memory-free personalization} as learning a single-vector user prior on top of a frozen generator, and study the operating point where per-user state is $O(d)$, inference uses zero personalized prompt tokens, and fitting requires only cached hidden states. + \item Across both LongLaMP Review and Topic Writing tasks at $K{=}4$ support examples, UPH significantly improves ROUGE-L over the unpersonalized base model ($p < 10^{-12}$, paired $t$-test) while matching or outperforming every personalization baseline we tested, including ICL retrieval (Prompt-All-K, BM25, Dense), profile-based summarization, and five PEFT methods (LoRA, Tiny LoRA, VeRA, Prompt Tuning, Prefix Tuning). + \item An ablation over $K{\in}\{4,8,16\}$ and $d{\in}\{8,16,32,64,128\}$ shows that the effect is stable across support set sizes and that $d{=}8$ (a 16-byte user prior) already recovers most of the ROUGE-L gain, revealing how compact a useful user prior can be. +\end{enumerate} + +%% Main table +\begin{table*}[t] + \caption{Main results on LongLaMP Topic and Review Writing with $K{=}4$ support examples per user, $N{=}200$ users per setting. ROUGE-L and METEOR are reported as mean$\pm$standard deviation across users. UPH uses 128~bytes per user and zero personalized prompt tokens at inference. Best ROUGE-L among personalized methods in \textbf{bold}; second-best \underline{underlined}. Higher is better. Inf.~tokens is the additional personalization prompt tokens carried at inference.} + \label{tab:main} + \small + \begin{tabularx}{\textwidth}{l r r C C C C} + \toprule + & \multicolumn{1}{c}{State} & \multicolumn{1}{c}{Inf.} & \multicolumn{2}{c}{\textbf{Review-user}} & \multicolumn{2}{c}{\textbf{Topic-user}} \\ + \cmidrule(lr){4-5} \cmidrule(lr){6-7} + Method & \multicolumn{1}{c}{(bytes)} & \multicolumn{1}{c}{tokens} & ROUGE-L & METEOR & ROUGE-L & METEOR \\ + \midrule + Base (no personalization) & 0 & 0 & .126$\pm$.028 & .155$\pm$.059 & .119$\pm$.023 & .204$\pm$.059 \\ + \midrule + \emph{In-context learning (ICL):} \\ + Prompt-All-K & ${\sim}3$K$^*$ & ${\sim}1500$ & \underline{.142$\pm$.031} & .195$\pm$.056 & .123$\pm$.067 & .185$\pm$.084 \\ + BM25-Top1 & ${\sim}0.7$K$^*$ & ${\sim}380$ & .140$\pm$.029 & .195$\pm$.057 & .119$\pm$.068 & .186$\pm$.085 \\ + Dense-Top1 & ${\sim}0.7$K$^*$ & ${\sim}380$ & \textbf{.143$\pm$.067} & .197$\pm$.080 & .119$\pm$.068 & .183$\pm$.086 \\ + Profile-based & ${\sim}0.6$K$^*$ & ${\sim}300$ & .121$\pm$.029 & .142$\pm$.057 & .112$\pm$.028 & .193$\pm$.068 \\ + \midrule + \emph{Parameter-efficient FT (PEFT):} \\ + LoRA ($r{=}8$) & 2.1M & 0 & .132$\pm$.029 & .197$\pm$.052 & .119$\pm$.028 & .188$\pm$.065 \\ + Tiny LoRA ($r{=}1$) & 272K & 0 & .126$\pm$.032 & .159$\pm$.056 & .118$\pm$.027 & .200$\pm$.066 \\ + VeRA ($r{=}256$) & 129K & 0 & .124$\pm$.027 & .155$\pm$.060 & .119$\pm$.025 & .202$\pm$.063 \\ + Prompt Tuning ($L{=}5$) & 15K & 0 & .005$\pm$.015 & .007$\pm$.015 & .045$\pm$.044 & .075$\pm$.075 \\ + Prefix Tuning ($L{=}5$) & 143K & 0 & .074$\pm$.047 & .051$\pm$.034 & .070$\pm$.031 & .077$\pm$.046 \\ + \midrule + \textbf{UPH (ours, $d{=}64$)} & \textbf{128} & \textbf{0} & \underline{.140$\pm$.032} & .149$\pm$.054 & \textbf{.132$\pm$.029} & .192$\pm$.056 \\ + \bottomrule + \end{tabularx} + \vspace{0.3em} + + {\footnotesize $^*$ICL baselines store user text (bytes shown are approximate average byte counts of retained support text); they have no learned per-user state.} +\end{table*} + +%% ============================================================ +%% SECTION 2: Experimental Setup + Main Results +%% ============================================================ + +\section{Experimental Setup} + +\paragraph{Dataset.} +We use LongLaMP, a benchmark for personalized long-form text generation. +We evaluate on two generation tasks: \textbf{Review Writing} (product reviews) and \textbf{Topic Writing} (Reddit-style topical posts). +We use the \emph{user} validation split for each task with $N{=}200$ users and $K{=}4$ support examples per user ($K{\in}\{8,16\}$ are also studied in the ablation). + +\paragraph{Model.} +We use Qwen2.5-1.5B-Instruct as the frozen backbone. +All methods share the same model weights; only the personalization mechanism differs. +Decoding uses greedy argmax with $\text{min/max}$ new tokens of $128/512$. + +\paragraph{Baselines.} +We compare against 10 baselines spanning three paradigms: +(1)~\textbf{Base}: the unpersonalized frozen LLM with no user context; +(2)~\emph{In-context learning (ICL):} \textbf{Prompt-All-K} concatenates all $K{=}4$ support examples into the prompt; \textbf{BM25-Top1} and \textbf{Dense-Top1} retrieve the single most relevant support example via BM25 and sentence-transformers (all-MiniLM-L6-v2), respectively; \textbf{Profile-based} uses the LLM to first generate a 2--3 sentence user writing style profile from the support set, and conditions generation on the profile; +(3)~\emph{PEFT:} per-user adapters are fit to the $K$ support examples for 30 gradient steps---\textbf{LoRA} ($r{=}8$), \textbf{Tiny LoRA} ($r{=}1$), \textbf{VeRA} ($r{=}256$) on $\{q,v\}$-projections, and \textbf{Prompt Tuning} / \textbf{Prefix Tuning} with 5 virtual tokens (100 steps, tuned learning rates). + +\paragraph{Metrics.} +Following the LongLaMP evaluation protocol, we report ROUGE-L (longest common subsequence overlap with the reference) and METEOR (semantic adequacy). +We emphasize ROUGE-L as the primary metric; METEOR is known to be sensitive to output length, which differs substantially across methods. + +\paragraph{UPH configuration.} +$d{=}64$, scaling $\alpha{=}0.1$, KL weight $\beta{=}0.05$, $\ell_2$ weight $\lambda{=}10^{-4}$, learning rate $0.05$, 30 Adam steps. +The entire per-user state is $d{\times}2{=}128$ bytes in bfloat16; adaptation takes ${\sim}7$\,s per user offline and can be cached. +At inference, UPH applies equation~(\ref{eq:uph}) at every decode step and blends the resulting logits 50/50 with the frozen base model's logits. + +%% ============================================================ +%% SECTION 3: Results +%% ============================================================ + +\section{Results} + +\subsection{Main Results} + +Table~\ref{tab:main} presents the main results at $K{=}4$ across all 10 baselines. +We observe three clear patterns. + +\paragraph{UPH significantly improves ROUGE-L over Base on both tasks.} +UPH improves Review-user ROUGE-L from .1258 to .1399 ($+11.2\%$, $p{=}4.2\mathrm{e}{-13}$) and Topic-user ROUGE-L from .1193 to .1321 ($+10.7\%$, $p{=}1.6\mathrm{e}{-13}$) by paired $t$-test on per-user scores. +This is achieved with a per-user state of only 128 bytes and \emph{zero} additional prompt tokens at inference. + +\paragraph{UPH matches or outperforms ICL on ROUGE-L, despite using no prompt tokens.} +On Review Writing, UPH (.1399) is statistically indistinguishable from the strongest ICL baselines on ROUGE-L (Prompt-All-K .1417, $p{=}0.49$; BM25 .1399, $p{=}0.97$; Dense .1433, $p{=}0.49$), even though ICL methods carry $\sim$380--1500 personalized prompt tokens at inference. +UPH's METEOR (.1491) remains below the ICL baselines' (.1949--.1966), which is expected: when verbatim exemplars are present in the prompt, the model can directly echo their semantic content, producing higher unigram/synonym overlap with the reference. +Crucially, on Topic Writing the pattern reverses sharply: UPH is \textbf{significantly better} than every ICL method on ROUGE-L (Dense .1187, $p{=}6.5\mathrm{e}{-3}$; BM25 .1198, $p{=}1.0\mathrm{e}{-2}$; Prompt-All-K .1229, $p{=}6.0\mathrm{e}{-2}$). +This is the regime where support--query mismatch is high (a user's past Reddit posts on one topic poorly predict their writing on a new one), and textual exemplars become \emph{negative} evidence for the model. + +\paragraph{UPH substantially outperforms all PEFT methods at $K{=}4$.} +Every PEFT method in Table~\ref{tab:main}---LoRA ($2.1$M parameters), Tiny LoRA ($272$K), VeRA ($129$K), Prompt Tuning ($15$K), Prefix Tuning ($143$K)---underperforms UPH on ROUGE-L, with differences statistically significant at $p < 10^{-2}$ or lower on both tasks. +Prompt Tuning and Prefix Tuning fail catastrophically at $K{=}4$ (ROUGE-L near 0), consistent with the view that few-shot optimization of even thousands of parameters is underdetermined when only ${\sim}1500$ target tokens of supervision are available per user. +Viewing the parameter-to-data ratio as a rough complexity index, UPH's 64-parameter state is roughly $10^4\times$ less over-parameterized than rank-1 LoRA at this scale, which is exactly the regime where a correctly-scaled inductive bias dominates expressiveness. + +\subsection{$K$ and $d$ Ablations} + +We study the sensitivity of UPH to the support-set size $K$ and the user-vector dimension $d$ to verify that our main finding is not an artifact of a particular operating point (Table~\ref{tab:ablation}). + +\paragraph{$K$ scaling.} +Doubling $K$ from 4 to 16 changes UPH's Review ROUGE-L by at most $0.002$ and Topic ROUGE-L by $0.001$. +The method is neither bottlenecked nor destabilized by more support data, consistent with a simple scalar-family prior that saturates quickly. +By contrast, the ICL baselines (BM25, Dense) improve with $K$ on Review ($+0.003$ on ROUGE-L from $K{=}4$ to $K{=}16$) but remain essentially flat on Topic, reinforcing the support--query-mismatch interpretation. + +\paragraph{$d$ scaling.} +Remarkably, a 16-byte per-user state ($d{=}8$) already captures most of the ROUGE-L gain on both tasks: Review .1353 and Topic .1316, versus .1399 and .1321 at $d{=}64$. +Gains saturate around $d{=}32$--$64$; $d{=}128$ does not improve over $d{=}64$ and slightly hurts on Review (.1370 vs.\ .1399). +This suggests that the user-specific information that UPH can extract from $K{=}4$ examples is genuinely low-dimensional, and that a practically deployed version of UPH could use $d{=}8$--$16$ to save per-user storage by 4--8$\times$ with negligible quality loss. + +\begin{table}[t] + \caption{UPH ablations on ROUGE-L (mean$\pm$std across users, $N{=}200$). \textbf{Top:} $K$-scaling at $d{=}64$. \textbf{Bottom:} $d$-scaling at $K{=}4$. Results are stable across a wide range of $K$ and $d$; $d{=}8$ (16 bytes) already recovers most of the gain.} + \label{tab:ablation} + \small + \setlength{\tabcolsep}{3pt} + \begin{tabularx}{\columnwidth}{l C C C} + \toprule + \multicolumn{4}{c}{$K$-scaling ($d{=}64$)} \\ + \cmidrule(lr){1-4} + Task & $K{=}4$ & $K{=}8$ & $K{=}16$ \\ + \midrule + Review-user & \textbf{.140$\pm$.032} & .137$\pm$.031 & .138$\pm$.032 \\ + Topic-user & .132$\pm$.029 & .131$\pm$.030 & \textbf{.133$\pm$.029} \\ + \bottomrule + \end{tabularx} + + \vspace{0.6em} + \begin{tabularx}{\columnwidth}{l C C C C C} + \toprule + \multicolumn{6}{c}{$d$-scaling ($K{=}4$)} \\ + \cmidrule(lr){1-6} + Task & $d{=}8$ & $d{=}16$ & $d{=}32$ & $d{=}64$ & $d{=}128$ \\ + \midrule + Review-user & .135$\pm$.033 & .137$\pm$.032 & .138$\pm$.032 & \textbf{.140$\pm$.032} & .137$\pm$.032 \\ + Topic-user & .132$\pm$.031 & .131$\pm$.030 & .132$\pm$.030 & \textbf{.132$\pm$.029} & .131$\pm$.030 \\ + \bottomrule + \end{tabularx} +\end{table} + +%% ============================================================ +%% SECTION 4: Discussion and Limitations +%% ============================================================ + +\section{Discussion and Limitations} + +\paragraph{Why so few parameters suffice.} +A recurring observation in Table~\ref{tab:main} is that methods with \emph{more} per-user parameters---LoRA, Tiny LoRA, Prompt/Prefix Tuning---do not translate capacity into quality in our few-shot regime; several underperform Base. +Our interpretation is that user-specific personalization from $K{=}4$ samples is a genuinely low-signal problem: ${\sim}1500$ target tokens contain limited orthogonal information about user style, and any adapter with thousands of trainable parameters has vastly more degrees of freedom than the data can identify. +By fixing the shared projection $\mathbf{U}$ and restricting optimization to a $d$-dimensional vector, UPH imposes a strong inductive prior that the user-specific component lives in a small random subspace of the model's hidden-state space. +The $d$-ablation corroborates this view: performance is flat across two decades of $d$, and $d{=}8$ already nearly matches $d{=}64$. + +\paragraph{When textual exemplars help, and when they don't.} +The Review / Topic split is the clearest qualitative story in our results. +On Review, past product reviews are plausibly informative about how the user will write about a new product---the task is domain-stable---and ICL methods benefit from direct verbatim signal. +On Topic, past Reddit posts on sports tell the model relatively little about how to write about cooking, and concatenating them into the prompt can displace more useful task priors. +UPH is insulated from this failure mode because it never injects raw user text; instead, it contributes a small additive bias to every decode step. +This suggests a practical deployment heuristic: the benefit of UPH over ICL increases as the diversity of per-user queries grows. + +\paragraph{Limitations.} +Three limitations delineate the scope of this LBR. +(i)~We do not conduct a controlled support--query mismatch study; the Review-vs-Topic contrast is suggestive but not a directly manipulated variable. +(ii)~UPH's METEOR gap on Review indicates that UPH does not replicate the semantic coverage benefit of verbatim exemplars; combining UPH with a short retrieved exemplar is a natural hybrid that we leave to future work. +(iii)~We evaluate a single 1.5B backbone; scaling to larger models, and allowing $\boldsymbol{\theta}_u$ to evolve online as a user accumulates more history, are important directions. + +\paragraph{Takeaway.} +A single 64-dimensional (or even 8-dimensional) per-user vector, fitted offline on four writing samples, yields a statistically significant and consistent improvement over a frozen LLM on two personalized generation tasks, is competitive with retrieval-based ICL on one, and beats it on the other---at the cost of no inference prompt tokens and negligible per-user storage. +This operating point has been largely unexplored in the personalized LLM literature, where prompt-side memory and model-side adapters dominate. +We hope these results motivate further study of compact user priors as a principled third option between text memory and PEFT. + +%% ============================================================ +%% Acknowledgments + References +%% ============================================================ + +\begin{acks} +TODO: Acknowledgments. +\end{acks} + +% No references (LBR; related work integrated in-line). + +\end{document} |
