Add paper and update README with full method description and resultsHEAD master

author: YurenHao0426 <Blackhao0426@gmail.com> 2026-03-18 18:48:39 -0500
committer: YurenHao0426 <Blackhao0426@gmail.com> 2026-03-18 18:48:39 -0500
commit: f5b8fe7275fd5c0b41d1e50034424865fe906564 (patch)
tree: 72504aad3ec022e5076aeef595d90296c1fde2ba
parent: 2274de15fcffb11d302ddcac9fabe5fc0e26ed47 (diff)
2 files changed, 84 insertions, 28 deletions
diff --git a/README.md b/README.md
index 74fd9e5..534e390 100644
--- a/README.md
+++ b/README.md
@@ -1,20 +1,61 @@
-# VARS: Vector-Augmented Retrieval System for Personalized LLM Assistants
+# VARS: Vector-Adapted Retrieval Scoring for Personalized LLM Assistants
 
-VARS is a personalization framework that enables LLM assistants to learn and adapt to individual user preferences over multi-session interactions. It combines **dense retrieval**, **reranking**, and **REINFORCE-based user vector learning** to deliver personalized responses without explicit user configuration.
+> **User Preference Modeling for Conversational LLM Agents: Weak Rewards from Retrieval-Augmented Interaction**
+>
+> Yuren Hao, Shuhaib Mehri, ChengXiang Zhai, Dilek Hakkani-Tür
+>
+> University of Illinois at Urbana-Champaign
+>
+> [[Paper]](paper/paper.pdf)
+
+## Overview
+
+Large language models are increasingly used as conversational collaborators, yet most lack a persistent user model, forcing users to repeatedly restate preferences across sessions. **VARS** (Vector-Adapted Retrieval Scoring) is a pipeline-agnostic, frozen-backbone framework that represents each user with long-term and short-term vectors in a shared preference space and uses these vectors to bias retrieval scoring over structured preference memory. The vectors are updated online from weak scalar feedback, enabling personalization without per-user fine-tuning.
+
+### Key Idea
+
+At each turn, the system:
+1. **Extracts** structured (condition, action) preferences from dialogue via a lightweight finetuned model
+2. **Stores** preferences as memory cards with dense embeddings in a FAISS index
+3. **Retrieves** relevant preferences via dense search + cross-encoder reranking, biased by a user-specific vector bonus
+4. **Updates** dual user vectors (long-term + short-term) online via REINFORCE from keyword-based reward signals
+
+The effective user vector at turn *t* combines stable cross-session identity with transient within-session context:
+
+```
+z_eff = β_L · z_long + β_S · z_short
+```
+
+## Results
+
+Evaluated on [MultiSessionCollab](https://github.com/shmehri/MultiSessionCollab) (60 profiles × 60 sessions, 3,600 sessions per method) across math and code tasks:
+
+| Method | Success (%) ↑ | Timeout (%) ↓ | User tokens ↓ |
+|--------|:---:|:---:|:---:|
+| Vanilla | 54.3 | 29.2 | 232.9 |
+| Contextual | 52.4 | 31.4 | 213.7 |
+| All-memory | 50.9 | 33.4 | 226.8 |
+| Reflection | 54.4 | 28.8 | 207.5 |
+| RAG | 52.0 | 44.3 | **188.4** |
+| **VARS** | **55.2** | **26.4** | 193.6 |
+
+VARS achieves the strongest overall performance, matches Reflection in task success while significantly reducing timeout rate (-2.4 pp, *p*=0.046) and user effort (-13.9 tokens, *p*=0.021). The learned long-term vectors align with cross-user preference overlap (*p*=0.006), while short-term vectors capture session-specific adaptation.
 
 ## Architecture
 
 ```
-User Query ──► Preference Retrieval (Dense + Rerank) ──► Augmented Prompt ──► LLM Response
-                       ▲                                                          │
-                       │                                                          ▼
-                 User Vector ◄──── REINFORCE Update ◄──── Implicit Feedback Signal
-                       ▲
-                       │
-              Preference Extractor ◄──── Conversation History
+User Query u_t ──► Dense Retrieval ──► Reranker ──► User-Aware Scoring ──► Top-J notes ──► LLM Response
+                        │                               s(u,m;U) = s_0 + ⟨z_eff, v_m⟩
+                        │                                      ▲
+                  Preference                              User Vector
+                  Memory (FAISS)                     z_eff = β_L·z_L + β_S·z_S
+                        ▲                                      ▲
+                        │                              REINFORCE Update
+                  Preference Extractor                 from keyword reward r̂_t
+                  M_ext (Qwen3-0.6B)
 ```
 
-### Core Components
+### Core Modules
 
 | Module | Description |
 |--------|-------------|
@@ -22,30 +63,34 @@ User Query ──► Preference Retrieval (Dense + Rerank) ──► Augmented P
 | `models/llm/vllm_chat.py` | vLLM HTTP client for high-throughput batched inference |
 | `models/embedding/qwen3_8b.py` | Dense embedding (Qwen3-Embedding-8B) |
 | `models/reranker/qwen3_reranker.py` | Cross-encoder reranking (Qwen3-Reranker-8B) |
-| `models/preference_extractor/` | Online preference extraction from conversation |
-| `retrieval/pipeline.py` | RAG retrieval pipeline with FAISS vector store |
-| `user_model/policy/reinforce.py` | REINFORCE policy for user vector optimization |
-| `feedback/` | Reward model (keyword / LLM judge) and online RL updates |
+| `models/preference_extractor/` | Lightweight preference extraction from conversation |
+| `retrieval/pipeline.py` | RAG retrieval pipeline with FAISS vector store and PCA item space |
+| `user_model/policy/reinforce.py` | REINFORCE policy for dual user-vector optimization |
+| `feedback/reward_model.py` | Keyword-based reward heuristic |
+| `feedback/handlers.py` | Retrieval-attribution gating and online RL updates |
 
-## Models
+## Models and Data
 
 ### Preference Extractor
 
-We fine-tuned a Qwen3-0.6B model for structured preference extraction from conversational context.
+A 0.6B-parameter Qwen3 model finetuned for structured preference extraction. Given a dialogue window, it outputs JSON preference tuples `{condition, action, confidence}`.
 
 - **Model**: [blackhao0426/pref-extractor-qwen3-0.6b-full-sft](https://huggingface.co/blackhao0426/pref-extractor-qwen3-0.6b-full-sft)
-- **Training Data**: [blackhao0426/user-preference-564k](https://huggingface.co/datasets/blackhao0426/user-preference-564k) (564K examples of user preference extraction)
+- **Training Data**: [blackhao0426/user-preference-564k](https://huggingface.co/datasets/blackhao0426/user-preference-564k) — 564K examples constructed from public chat logs (LMSYS-Chat, WildChat), instruction-tuning corpora (Alpaca, SlimOrca), and GPT-5.1-labeled preference JSON
 
-The extractor takes conversation turns as input and outputs structured `{condition, action, confidence}` preference tuples.
+On a held-out set, the extractor achieves 99.7% JSON validity and 97.5% recall at 37.7% precision (intentionally high-recall; downstream reranker and user vector filter irrelevant cards).
 
-### Other Models Used
+### Backbone Models
 
-| Role | Model |
-|------|-------|
-| Agent LLM | LLaMA-3.1-8B-Instruct (via vLLM) |
-| Dense Embedding | Qwen3-Embedding-8B |
-| Reranker | Qwen3-Reranker-8B |
-| Reward Judge | LLaMA-3.1-8B-Instruct or GPT-4o-mini |
+| Role | Model | Parameters |
+|------|-------|------------|
+| Agent LLM | Llama-3.1-8B-Instruct (via vLLM) | 8B |
+| User Simulator | Llama-3.3-70B-Instruct (via vLLM) | 70B |
+| Dense Embedding | Qwen3-Embedding-8B | 8B |
+| Reranker | Qwen3-Reranker-8B | 8B |
+| Preference Extractor | Qwen3-0.6B (finetuned) | 0.6B |
+
+All backbone components are kept frozen; online adaptation occurs only through the per-user vectors.
 
 ## Installation
 
@@ -59,7 +104,7 @@ pip install -e .
 - PyTorch >= 2.3.0
 - Transformers >= 4.44.0
 
-## Usage
+## Quick Start
 
 ```python
 from personalization.serving import PersonalizedLLM
@@ -72,8 +117,19 @@ response = llm.chat(user_id="user_001", query="Explain quicksort")
 # The system automatically:
 # 1. Extracts preferences from conversation history
 # 2. Retrieves relevant preferences via dense retrieval + reranking
-# 3. Augments the prompt with personalized context
-# 4. Updates user vector from implicit feedback (REINFORCE)
+# 3. Adds user-vector bonus to retrieval scores
+# 4. Augments the LLM prompt with top-ranked preference notes
+# 5. Updates user vectors from implicit feedback (REINFORCE)
+```
+
+## Citation
+
+```bibtex
+@article{hao2025vars,
+  title={User Preference Modeling for Conversational LLM Agents: Weak Rewards from Retrieval-Augmented Interaction},
+  author={Hao, Yuren and Mehri, Shuhaib and Zhai, ChengXiang and Hakkani-T{\"u}r, Dilek},
+  year={2025}
+}
 ```
 
 ## License
diff --git a/paper/paper.pdf b/paper/paper.pdf
new file mode 100644
index 0000000..611e99a
--- /dev/null
+++ b/paper/paper.pdf
author	YurenHao0426 <Blackhao0426@gmail.com>	2026-03-18 18:48:39 -0500
committer	YurenHao0426 <Blackhao0426@gmail.com>	2026-03-18 18:48:39 -0500
commit	f5b8fe7275fd5c0b41d1e50034424865fe906564 (patch)
tree	72504aad3ec022e5076aeef595d90296c1fde2ba
parent	2274de15fcffb11d302ddcac9fabe5fc0e26ed47 (diff)