# VARS: Vector-Adapted Retrieval Scoring for Personalized LLM Assistants

> **User Preference Modeling for Conversational LLM Agents: Weak Rewards from Retrieval-Augmented Interaction**
>
> Yuren Hao, Shuhaib Mehri, ChengXiang Zhai, Dilek Hakkani-Tür
>
> University of Illinois at Urbana-Champaign
>
> [[Paper]](paper/paper.pdf)

## Overview

Large language models are increasingly used as conversational collaborators, yet most lack a persistent user model, forcing users to repeatedly restate preferences across sessions. **VARS** (Vector-Adapted Retrieval Scoring) is a pipeline-agnostic, frozen-backbone framework that represents each user with long-term and short-term vectors in a shared preference space and uses these vectors to bias retrieval scoring over structured preference memory. The vectors are updated online from weak scalar feedback, enabling personalization without per-user fine-tuning.

### Key Idea

At each turn, the system:
1. **Extracts** structured (condition, action) preferences from dialogue via a lightweight finetuned model
2. **Stores** preferences as memory cards with dense embeddings in a FAISS index
3. **Retrieves** relevant preferences via dense search + cross-encoder reranking, biased by a user-specific vector bonus
4. **Updates** dual user vectors (long-term + short-term) online via REINFORCE from keyword-based reward signals

The effective user vector at turn *t* combines stable cross-session identity with transient within-session context:

```
z_eff = β_L · z_long + β_S · z_short
```

## Results

Evaluated on [MultiSessionCollab](https://github.com/shmehri/MultiSessionCollab) (60 profiles × 60 sessions, 3,600 sessions per method) across math and code tasks:

| Method | Success (%) ↑ | Timeout (%) ↓ | User tokens ↓ |
|--------|:---:|:---:|:---:|
| Vanilla | 54.3 | 29.2 | 232.9 |
| Contextual | 52.4 | 31.4 | 213.7 |
| All-memory | 50.9 | 33.4 | 226.8 |
| Reflection | 54.4 | 28.8 | 207.5 |
| RAG | 52.0 | 44.3 | **188.4** |
| **VARS** | **55.2** | **26.4** | 193.6 |

VARS achieves the strongest overall performance, matches Reflection in task success while significantly reducing timeout rate (-2.4 pp, *p*=0.046) and user effort (-13.9 tokens, *p*=0.021). The learned long-term vectors align with cross-user preference overlap (*p*=0.006), while short-term vectors capture session-specific adaptation.

## Architecture

```
User Query u_t ──► Dense Retrieval ──► Reranker ──► User-Aware Scoring ──► Top-J notes ──► LLM Response
                        │                               s(u,m;U) = s_0 + ⟨z_eff, v_m⟩
                        │                                      ▲
                  Preference                              User Vector
                  Memory (FAISS)                     z_eff = β_L·z_L + β_S·z_S
                        ▲                                      ▲
                        │                              REINFORCE Update
                  Preference Extractor                 from keyword reward r̂_t
                  M_ext (Qwen3-0.6B)
```

### Core Modules

| Module | Description |
|--------|-------------|
| `serving/personalized_llm.py` | Main inference interface (`chat()`, `chat_prepare()`, `chat_complete()`) |
| `models/llm/vllm_chat.py` | vLLM HTTP client for high-throughput batched inference |
| `models/embedding/qwen3_8b.py` | Dense embedding (Qwen3-Embedding-8B) |
| `models/reranker/qwen3_reranker.py` | Cross-encoder reranking (Qwen3-Reranker-8B) |
| `models/preference_extractor/` | Lightweight preference extraction from conversation |
| `retrieval/pipeline.py` | RAG retrieval pipeline with FAISS vector store and PCA item space |
| `user_model/policy/reinforce.py` | REINFORCE policy for dual user-vector optimization |
| `feedback/reward_model.py` | Keyword-based reward heuristic |
| `feedback/handlers.py` | Retrieval-attribution gating and online RL updates |

## Models and Data

### Preference Extractor

A 0.6B-parameter Qwen3 model finetuned for structured preference extraction. Given a dialogue window, it outputs JSON preference tuples `{condition, action, confidence}`.

- **Model**: [blackhao0426/pref-extractor-qwen3-0.6b-full-sft](https://huggingface.co/blackhao0426/pref-extractor-qwen3-0.6b-full-sft)
- **Training Data**: [blackhao0426/user-preference-564k](https://huggingface.co/datasets/blackhao0426/user-preference-564k) — 564K examples constructed from public chat logs (LMSYS-Chat, WildChat), instruction-tuning corpora (Alpaca, SlimOrca), and GPT-5.1-labeled preference JSON

On a held-out set, the extractor achieves 99.7% JSON validity and 97.5% recall at 37.7% precision (intentionally high-recall; downstream reranker and user vector filter irrelevant cards).

### Backbone Models

| Role | Model | Parameters |
|------|-------|------------|
| Agent LLM | Llama-3.1-8B-Instruct (via vLLM) | 8B |
| User Simulator | Llama-3.3-70B-Instruct (via vLLM) | 70B |
| Dense Embedding | Qwen3-Embedding-8B | 8B |
| Reranker | Qwen3-Reranker-8B | 8B |
| Preference Extractor | Qwen3-0.6B (finetuned) | 0.6B |

All backbone components are kept frozen; online adaptation occurs only through the per-user vectors.

## Installation

```bash
pip install -e .
```

### Requirements

- Python >= 3.10
- PyTorch >= 2.3.0
- Transformers >= 4.44.0

## Quick Start

```python
from personalization.serving import PersonalizedLLM

llm = PersonalizedLLM.from_config("configs/local_models.yaml")

# Multi-turn personalized chat
response = llm.chat(user_id="user_001", query="Explain quicksort")

# The system automatically:
# 1. Extracts preferences from conversation history
# 2. Retrieves relevant preferences via dense retrieval + reranking
# 3. Adds user-vector bonus to retrieval scores
# 4. Augments the LLM prompt with top-ranked preference notes
# 5. Updates user vectors from implicit feedback (REINFORCE)
```

## Citation

```bibtex
@article{hao2025vars,
  title={User Preference Modeling for Conversational LLM Agents: Weak Rewards from Retrieval-Augmented Interaction},
  author={Hao, Yuren and Mehri, Shuhaib and Zhai, ChengXiang and Hakkani-T{\"u}r, Dilek},
  year={2025}
}
```

## License

Apache-2.0