CLAUDE.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387

# CLAUDE.md

This file provides guidance to Claude Code when working with this repository.

---

## Project Overview

**Personalization User Model** is a research project for **personalized LLM assistants** that learn and adapt to individual user preferences over multi-session interactions.

### Research Goal
Build an AI assistant that:
1. **Extracts** user preferences from conversation (e.g., "I prefer bullet points", "show me step-by-step math")
2. **Stores** preferences in a retrieval-augmented memory system
3. **Retrieves** relevant preferences using dense retrieval + reranking
4. **Adapts** responses using a learned user vector (REINFORCE-based RL)
5. **Improves** over multiple sessions without explicit user configuration

### Key Innovation
Unlike static preference profiles, this system uses:
- **Online preference extraction** from natural conversation
- **Policy-based memory retrieval** with user-specific vectors
- **REINFORCE updates** from implicit user feedback (enforcement signals)

---

## Repository Structure

```
personalization-user-model/
├── src/personalization/           # Core personalization library
│   ├── serving/                   # Main PersonalizedLLM class
│   │   └── personalized_llm.py    # Primary inference interface
│   ├── models/                    # LLM backends (vLLM, local)
│   │   └── llm/vllm_chat.py       # vLLM HTTP client
│   ├── retrieval/                 # Dense retrieval + reranking
│   ├── feedback/                  # REINFORCE reward processing
│   ├── user_model/                # User vector learning
│   └── evaluation/                # Metrics and analysis
│
├── collaborativeagents/           # Experiment framework (MULTISESSIONCOLLAB)
│   ├── adapters/                  # Method adapters for experiments
│   │   ├── personalized_llm_adapter.py  # RAG methods
│   │   ├── contextual_adapter.py        # Full-context baseline
│   │   └── reflection_adapter.py        # CollaborativeAgents baseline
│   ├── agents/                    # Batch processing clients
│   │   └── batch_vllm_agent.py    # Async vLLM/OpenAI batching
│   ├── scripts/                   # Experiment runners
│   │   └── run_experiments.py     # Main experiment script
│   ├── slurm/                     # HPC job scripts
│   │   └── fullscale/             # Full-scale experiment jobs
│   ├── data/                      # User profiles and datasets
│   │   └── complex_profiles_v2/   # 200 user profiles (43 prefs each)
│   └── results/                   # Experiment outputs
│
├── models/                        # Downloaded HuggingFace models (not in git)
│   ├── llama-3.1-8b-instruct/     # Agent LLM
│   ├── qwen3-embedding-8b/        # Dense embeddings
│   └── rerankers/                 # Qwen3 8B reranker
│
├── data/                          # Datasets and corpora (not in git)
│   ├── corpora/                   # Memory card storage
│   └── users/                     # User state persistence
│
└── LLaMA-Factory/                 # External fine-tuning toolkit
```

---

## Methods (Baselines)

The experiment compares **6 methods** for multi-session personalization:

| Method | Description | Memory Type |
|--------|-------------|-------------|
| `vanilla` | No memory, pure LLM | None |
| `contextual` | Full conversation history in context | In-context |
| `reflection` | Session-level reflection → agent_notes | Summarized notes |
| `all_memory` | All extracted preferences in context | All memories |
| `rag` | Dense retrieval + reranking (no user vector) | Retrieved top-k |
| `rag_vector` | RAG + learned user vector (proposed) | Retrieved + personalized |

---

## Setup

### 1. Environment
```bash
cd /projects/bfqt/users/yurenh2/ml-projects/personalization-user-model
source /u/yurenh2/miniforge3/etc/profile.d/conda.sh
conda activate eval

export PYTHONPATH="${PWD}/src:${PWD}/collaborativeagents:${PYTHONPATH}"
export HF_HOME=/projects/bfqt/users/yurenh2/hf_cache/huggingface
```

### 2. Environment Variables
Create `.env` file with:
```bash
OPENAI_API_KEY=sk-...          # For GPT user simulator
HF_TOKEN=hf_...                # For gated models
```

### 3. Models Required
- **Agent LLM**: `models/llama-3.1-8b-instruct/` (local) or vLLM server
- **User Simulator**: 70B model via vLLM or OpenAI API (gpt-5-mini)
- **Reward Model**: `models/llama-3.1-8b-instruct/` for local reward (optional, same as agent)
- **Embeddings**: `models/qwen3-embedding-8b/` (for RAG methods)
- **Reranker**: `models/rerankers/` or Qwen3-8B (for RAG methods)

### 4. Storage Locations
| Path | Quota | Usage |
|------|-------|-------|
| `/projects/bfqt` | 500GB (soft) | Code, models, results |
| `/work/hdd/bfqt` | 1TB | Overflow storage, large checkpoints |
| `/work/nvme/bfqt` | 500GB | Fast scratch (temporary) |

---

## Running Experiments

### Quick Test (2 profiles, 2 sessions)
```bash
cd collaborativeagents/scripts
python run_experiments.py \
    --methods vanilla \
    --datasets math-hard \
    --n-profiles 2 \
    --n-sessions 2 \
    --max-turns 8 \
    --use-vllm \
    --vllm-agent-url http://localhost:8003/v1 \
    --parallel-profiles 2 \
    --profile-path ../data/complex_profiles_v2/profiles_200.jsonl \
    --output-dir ../results/test
```

### Full-Scale Experiment
```bash
# GPU Layout (4x A100 80GB):
#   GPU 0-1: 70B user simulator (TP=2)
#   GPU 2:   8B agent
#   GPU 3:   Embedding + Reranker

cd collaborativeagents/slurm/fullscale
sbatch test_local_user.sh
```

### Key Arguments
| Argument | Description |
|----------|-------------|
| `--methods` | Comma-separated: vanilla,contextual,reflection,all_memory,rag,rag_vector |
| `--n-profiles` | Number of user profiles (max 200) |
| `--n-sessions` | Sessions per profile |
| `--max-turns` | Max turns per session |
| `--use-vllm` | Use vLLM for agent (required for batching) |
| `--use-openai-user` | Use OpenAI API for user simulator |
| `--vllm-user-url` | Local vLLM user simulator URL |
| `--parallel-profiles` | Batch size for turn-synchronous processing |
| `--reward-mode` | `keyword`, `llm` (GPT-4o-mini), or `llm_local` (local vLLM) |
| `--reward-vllm-url` | vLLM URL for local reward model (when `--reward-mode=llm_local`) |

---

## Current Results

### Completed Experiments
Located in `collaborativeagents/results/`:

| Experiment | Profiles | Sessions | Methods | Status |
|------------|----------|----------|---------|--------|
| `rag_vector_v2_*` | 10 | 10 | rag_vector | Complete |
| `gpt_user_all_methods_*` | 5-10 | 2-5 | all 6 | Partial |
| `test_50parallel_*` | 50 | 1 | vanilla | Test only |

### Throughput Benchmarks
| Setup | Throughput | Notes |
|-------|------------|-------|
| OpenAI user + vLLM agent | ~60 sessions/hr | API latency bottleneck |
| Local 70B user + 8B agent | ~2000+ sessions/hr | Expected (not yet tested) |

---

## Reward Models

The system supports three reward modes for RL updates:

| Mode | Model | Pros | Cons |
|------|-------|------|------|
| `keyword` | Heuristic | Fast, no API/GPU | Less accurate |
| `llm` | GPT-4o-mini (API) | High accuracy | API costs, latency |
| `llm_local` | Llama-3.1-8B (vLLM) | High accuracy, no API costs, batch processing | Requires GPU |

### Local Reward Model Setup

The local reward model uses the same classification prompt as GPT-4o-mini but runs on a local vLLM server.

**Test Results** (Llama-3.1-8B vs GPT-4o-mini):
- Accuracy: 83-92% (both models)
- Agreement: 100%
- Throughput: 3.6 samples/sec (batched)

**Files**:
- `src/personalization/feedback/local_llm_reward.py` - `LocalLLMRewardClient` with batch support
- `scripts/test_local_reward_batch.py` - Test script

**Usage**:
```bash
# Start reward model vLLM server (GPU 3)
python -m vllm.entrypoints.openai.api_server \
    --model models/llama-3.1-8b-instruct \
    --port 8005 \
    --tensor-parallel-size 1 \
    --dtype bfloat16 \
    --max-model-len 4096

# Run experiment with local reward
python run_experiments.py \
    --reward-mode llm_local \
    --reward-vllm-url http://localhost:8005/v1 \
    ...
```

**Classification Labels**:
- `neg_constraint_restate`: User reasserts constraints (reward: -1.0)
- `neg_correction`: User indicates content is wrong (reward: -0.8)
- `neg_confusion`: User indicates confusion (reward: -0.6)
- `pos_praise`: Explicit praise (reward: +0.8)
- `pos_progress`: Constructive continuation (reward: +0.1)
- `neutral`: Ambiguous feedback (reward: 0.0)
- `topic_shift`: New topic (reward: 0.0, no update)

---

## Experiments To Be Done

### 1. Full-Scale Benchmark (Priority: HIGH)
**Goal**: 200 profiles × 6 methods × 15 sessions = 18,000 sessions

**Setup**:
- User simulator: Local 70B (vLLM, TP=2) - NOT OpenAI (too slow)
- Agent: 8B LLaMA (vLLM)
- Reward: Local 8B LLM judge (vLLM) - faster than API, same accuracy

**GPU Layout** (4x A100 80GB):
```
GPU 0-1: 70B user simulator (AWQ INT4, TP=2)
GPU 2:   8B agent
GPU 3:   8B reward model + Embedding + Reranker (for RAG methods)
```

**Note**: The reward model (Llama-3.1-8B) shares GPU 3 with embedding/reranker models. Use `--reward-mode llm_local --reward-vllm-url http://localhost:8005/v1` to enable.

**Jobs**: Split by method and profile range (50 profiles each)
```
collaborativeagents/slurm/fullscale/
├── run_vanilla_p{0,50,100,150}.sh
├── run_contextual_p{0,50,100,150}.sh
├── run_reflection_p{0,50,100,150}.sh
├── run_all_memory_p{0,50,100,150}.sh
├── run_rag_p{0,50,100,150}.sh
├── run_rag_vector_p{0,50,100,150}.sh
└── submit_all.sh
```

### 2. Session Extension (Priority: MEDIUM)
If 15 sessions insufficient, continue to 30 sessions using checkpoint:
```bash
python run_experiments.py \
    --n-sessions 30 \
    --continue-from ../results/fullscale_15sess/...
```

### 3. Ablation Studies (Priority: LOW)
- RAG with BGE reranker (278M) vs Qwen3 (8B)
- Best-of-N sampling (N=3) for RAG methods
- Different embedding models

---

## Key Files Reference

### Core Personalization
| File | Purpose |
|------|---------|
| `src/personalization/serving/personalized_llm.py` | Main inference class with `chat()`, `chat_prepare()`, `chat_complete()` |
| `src/personalization/models/llm/vllm_chat.py` | vLLM HTTP client with `build_messages()`, `answer()` |
| `src/personalization/retrieval/policy.py` | Memory retrieval with user vector |
| `src/personalization/feedback/llm_reward.py` | GPT-based reward judge (API) |
| `src/personalization/feedback/local_llm_reward.py` | Local vLLM-based reward judge (batch processing) |

### Experiment Framework
| File | Purpose |
|------|---------|
| `collaborativeagents/scripts/run_experiments.py` | Main experiment runner with batch processing |
| `collaborativeagents/adapters/*.py` | Method-specific adapters with `prepare_prompt()`, `process_response()` |
| `collaborativeagents/agents/batch_vllm_agent.py` | `BatchVLLMClient` and `BatchOpenAIClient` for async batching |

### Data
| File | Purpose |
|------|---------|
| `collaborativeagents/data/complex_profiles_v2/profiles_200.jsonl` | 200 user profiles with 43 preferences each |
| `data/corpora/empty_store/` | Empty memory store for fresh experiments |

---

## Troubleshooting

### Quota Exceeded
```bash
# Check quota
quota -s

# Move large files to HDD storage
mv /projects/bfqt/users/yurenh2/large_dir /work/hdd/bfqt/users/yurenh2/
```

### vLLM Server Issues
```bash
# Check if server is running
curl http://localhost:8003/health

# Kill existing servers
pkill -f "vllm.entrypoints"
```

### Out of GPU Memory
- Reduce `--gpu-memory-utilization` (default 0.90)
- Reduce `--max-model-len` (default 8192)
- Use quantized models (AWQ INT4)

### Slow Throughput
- Use local vLLM user simulator instead of OpenAI API
- Increase `--parallel-profiles` for better batching
- Check vLLM logs for "Running: N reqs" to verify batching

---

## Code Conventions

1. **Batch Processing**: All adapters must implement `prepare_prompt()` and `process_response()` for batched vLLM calls
2. **Device Assignment**: GPUs 0-1 for large models, GPU 2 for agent, GPU 3 for embedding/reranker
3. **Checkpoints**: Session-level tracking in `checkpoint.json` with `sessions_per_profile` dict
4. **Results**: JSON format in `results.json` with metrics per session

---

## Contact

For questions about this codebase, refer to the experiment plan at:
`/u/yurenh2/.claude/plans/effervescent-mapping-ocean.md`

---

## Future Improvements (To Try)

### Retrieval Quality Improvements

**Problem Identified**: Current retrieval uses raw user message as query (e.g., "shortest palindrome"), but this doesn't match well with preference descriptions (e.g., "break down code with explanations"). Reranker matches task content, not preference applicability.

**Proposed Solutions**:

#### 1. Query Transformation
Instead of using raw user message as retrieval query, construct preference-oriented queries:
- Option A: Use LLM to generate "what user preferences might apply to this task?"
- Option B: Append task-type keywords to query (e.g., "code explanation preferences for: shortest palindrome")
- Option C: Multi-query retrieval - one for task content, one for task type/category

#### 4. Global vs Conditional Preferences  
Separate preferences into two tiers:
- **Global preferences**: High-frequency, always-applicable (e.g., "always use numbered steps", "use Python for code")
  - Always include in context, no retrieval needed
  - Identify via frequency analysis or explicit "When general" condition
- **Conditional preferences**: Context-specific (e.g., "when debugging, focus on specific issue")
  - Only these need retrieval based on task context
  - Reduces retrieval burden and ensures universal preferences never missed

**Implementation Notes**:
- Can be tested as ablation after current experiments complete
- Evaluate by: enforcement rate reduction, retrieval recall of actually-enforced preferences

---

## RAG Improvement Ideas

See [docs/rag_improvement_ideas.md](docs/rag_improvement_ideas.md) for detailed brainstorming on how to improve RAG retrieval quality and reduce timeout rate.