CLAUDE.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295

# CLAUDE.md

This file provides guidance to Claude Code when working with this repository.

---

## Project Overview

**Personalization User Model** is a research project for **personalized LLM assistants** that learn and adapt to individual user preferences over multi-session interactions.

### Research Goal
Build an AI assistant that:
1. **Extracts** user preferences from conversation (e.g., "I prefer bullet points", "show me step-by-step math")
2. **Stores** preferences in a retrieval-augmented memory system
3. **Retrieves** relevant preferences using dense retrieval + reranking
4. **Adapts** responses using a learned user vector (REINFORCE-based RL)
5. **Improves** over multiple sessions without explicit user configuration

### Key Innovation
Unlike static preference profiles, this system uses:
- **Online preference extraction** from natural conversation
- **Policy-based memory retrieval** with user-specific vectors
- **REINFORCE updates** from implicit user feedback (enforcement signals)

---

## Repository Structure

```
personalization-user-model/
├── src/personalization/           # Core personalization library
│   ├── serving/                   # Main PersonalizedLLM class
│   │   └── personalized_llm.py    # Primary inference interface
│   ├── models/                    # LLM backends (vLLM, local)
│   │   └── llm/vllm_chat.py       # vLLM HTTP client
│   ├── retrieval/                 # Dense retrieval + reranking
│   ├── feedback/                  # REINFORCE reward processing
│   ├── user_model/                # User vector learning
│   └── evaluation/                # Metrics and analysis
│
├── collaborativeagents/           # Experiment framework (MULTISESSIONCOLLAB)
│   ├── adapters/                  # Method adapters for experiments
│   │   ├── personalized_llm_adapter.py  # RAG methods
│   │   ├── contextual_adapter.py        # Full-context baseline
│   │   └── reflection_adapter.py        # CollaborativeAgents baseline
│   ├── agents/                    # Batch processing clients
│   │   └── batch_vllm_agent.py    # Async vLLM/OpenAI batching
│   ├── scripts/                   # Experiment runners
│   │   └── run_experiments.py     # Main experiment script
│   ├── slurm/                     # HPC job scripts
│   │   └── fullscale/             # Full-scale experiment jobs
│   ├── data/                      # User profiles and datasets
│   │   └── complex_profiles_v2/   # 200 user profiles (43 prefs each)
│   └── results/                   # Experiment outputs
│
├── models/                        # Downloaded HuggingFace models (not in git)
│   ├── llama-3.1-8b-instruct/     # Agent LLM
│   ├── qwen3-embedding-8b/        # Dense embeddings
│   └── rerankers/                 # Qwen3 8B reranker
│
├── data/                          # Datasets and corpora (not in git)
│   ├── corpora/                   # Memory card storage
│   └── users/                     # User state persistence
│
└── LLaMA-Factory/                 # External fine-tuning toolkit
```

---

## Methods (Baselines)

The experiment compares **6 methods** for multi-session personalization:

| Method | Description | Memory Type |
|--------|-------------|-------------|
| `vanilla` | No memory, pure LLM | None |
| `contextual` | Full conversation history in context | In-context |
| `reflection` | Session-level reflection → agent_notes | Summarized notes |
| `all_memory` | All extracted preferences in context | All memories |
| `rag` | Dense retrieval + reranking (no user vector) | Retrieved top-k |
| `rag_vector` | RAG + learned user vector (proposed) | Retrieved + personalized |

---

## Setup

### 1. Environment
```bash
cd /projects/bfqt/users/yurenh2/ml-projects/personalization-user-model
source /u/yurenh2/miniforge3/etc/profile.d/conda.sh
conda activate eval

export PYTHONPATH="${PWD}/src:${PWD}/collaborativeagents:${PYTHONPATH}"
export HF_HOME=/projects/bfqt/users/yurenh2/hf_cache/huggingface
```

### 2. Environment Variables
Create `.env` file with:
```bash
OPENAI_API_KEY=sk-...          # For GPT user simulator
HF_TOKEN=hf_...                # For gated models
```

### 3. Models Required
- **Agent LLM**: `models/llama-3.1-8b-instruct/` (local) or vLLM server
- **User Simulator**: 70B model via vLLM or OpenAI API (gpt-5-mini)
- **Embeddings**: `models/qwen3-embedding-8b/` (for RAG methods)
- **Reranker**: `models/rerankers/` or Qwen3-8B (for RAG methods)

### 4. Storage Locations
| Path | Quota | Usage |
|------|-------|-------|
| `/projects/bfqt` | 500GB (soft) | Code, models, results |
| `/work/hdd/bfqt` | 1TB | Overflow storage, large checkpoints |
| `/work/nvme/bfqt` | 500GB | Fast scratch (temporary) |

---

## Running Experiments

### Quick Test (2 profiles, 2 sessions)
```bash
cd collaborativeagents/scripts
python run_experiments.py \
    --methods vanilla \
    --datasets math-hard \
    --n-profiles 2 \
    --n-sessions 2 \
    --max-turns 8 \
    --use-vllm \
    --vllm-agent-url http://localhost:8003/v1 \
    --parallel-profiles 2 \
    --profile-path ../data/complex_profiles_v2/profiles_200.jsonl \
    --output-dir ../results/test
```

### Full-Scale Experiment
```bash
# GPU Layout (4x A100 80GB):
#   GPU 0-1: 70B user simulator (TP=2)
#   GPU 2:   8B agent
#   GPU 3:   Embedding + Reranker

cd collaborativeagents/slurm/fullscale
sbatch test_local_user.sh
```

### Key Arguments
| Argument | Description |
|----------|-------------|
| `--methods` | Comma-separated: vanilla,contextual,reflection,all_memory,rag,rag_vector |
| `--n-profiles` | Number of user profiles (max 200) |
| `--n-sessions` | Sessions per profile |
| `--max-turns` | Max turns per session |
| `--use-vllm` | Use vLLM for agent (required for batching) |
| `--use-openai-user` | Use OpenAI API for user simulator |
| `--vllm-user-url` | Local vLLM user simulator URL |
| `--parallel-profiles` | Batch size for turn-synchronous processing |
| `--reward-mode` | `keyword` (heuristic) or `llm` (GPT-5-nano judge) |

---

## Current Results

### Completed Experiments
Located in `collaborativeagents/results/`:

| Experiment | Profiles | Sessions | Methods | Status |
|------------|----------|----------|---------|--------|
| `rag_vector_v2_*` | 10 | 10 | rag_vector | Complete |
| `gpt_user_all_methods_*` | 5-10 | 2-5 | all 6 | Partial |
| `test_50parallel_*` | 50 | 1 | vanilla | Test only |

### Throughput Benchmarks
| Setup | Throughput | Notes |
|-------|------------|-------|
| OpenAI user + vLLM agent | ~60 sessions/hr | API latency bottleneck |
| Local 70B user + 8B agent | ~2000+ sessions/hr | Expected (not yet tested) |

---

## Experiments To Be Done

### 1. Full-Scale Benchmark (Priority: HIGH)
**Goal**: 200 profiles × 6 methods × 15 sessions = 18,000 sessions

**Setup**:
- User simulator: Local 70B (vLLM, TP=2) - NOT OpenAI (too slow)
- Agent: 8B LLaMA (vLLM)
- Reward: LLM judge (GPT-5-nano via API)

**GPU Layout** (4x A100 80GB):
```
GPU 0-1: 70B user simulator (AWQ INT4, TP=2)
GPU 2:   8B agent
GPU 3:   Embedding + Reranker (for RAG methods)
```

**Jobs**: Split by method and profile range (50 profiles each)
```
collaborativeagents/slurm/fullscale/
├── run_vanilla_p{0,50,100,150}.sh
├── run_contextual_p{0,50,100,150}.sh
├── run_reflection_p{0,50,100,150}.sh
├── run_all_memory_p{0,50,100,150}.sh
├── run_rag_p{0,50,100,150}.sh
├── run_rag_vector_p{0,50,100,150}.sh
└── submit_all.sh
```

### 2. Session Extension (Priority: MEDIUM)
If 15 sessions insufficient, continue to 30 sessions using checkpoint:
```bash
python run_experiments.py \
    --n-sessions 30 \
    --continue-from ../results/fullscale_15sess/...
```

### 3. Ablation Studies (Priority: LOW)
- RAG with BGE reranker (278M) vs Qwen3 (8B)
- Best-of-N sampling (N=3) for RAG methods
- Different embedding models

---

## Key Files Reference

### Core Personalization
| File | Purpose |
|------|---------|
| `src/personalization/serving/personalized_llm.py` | Main inference class with `chat()`, `chat_prepare()`, `chat_complete()` |
| `src/personalization/models/llm/vllm_chat.py` | vLLM HTTP client with `build_messages()`, `answer()` |
| `src/personalization/retrieval/policy.py` | Memory retrieval with user vector |
| `src/personalization/feedback/llm_reward.py` | GPT-based reward judge |

### Experiment Framework
| File | Purpose |
|------|---------|
| `collaborativeagents/scripts/run_experiments.py` | Main experiment runner with batch processing |
| `collaborativeagents/adapters/*.py` | Method-specific adapters with `prepare_prompt()`, `process_response()` |
| `collaborativeagents/agents/batch_vllm_agent.py` | `BatchVLLMClient` and `BatchOpenAIClient` for async batching |

### Data
| File | Purpose |
|------|---------|
| `collaborativeagents/data/complex_profiles_v2/profiles_200.jsonl` | 200 user profiles with 43 preferences each |
| `data/corpora/empty_store/` | Empty memory store for fresh experiments |

---

## Troubleshooting

### Quota Exceeded
```bash
# Check quota
quota -s

# Move large files to HDD storage
mv /projects/bfqt/users/yurenh2/large_dir /work/hdd/bfqt/users/yurenh2/
```

### vLLM Server Issues
```bash
# Check if server is running
curl http://localhost:8003/health

# Kill existing servers
pkill -f "vllm.entrypoints"
```

### Out of GPU Memory
- Reduce `--gpu-memory-utilization` (default 0.90)
- Reduce `--max-model-len` (default 8192)
- Use quantized models (AWQ INT4)

### Slow Throughput
- Use local vLLM user simulator instead of OpenAI API
- Increase `--parallel-profiles` for better batching
- Check vLLM logs for "Running: N reqs" to verify batching

---

## Code Conventions

1. **Batch Processing**: All adapters must implement `prepare_prompt()` and `process_response()` for batched vLLM calls
2. **Device Assignment**: GPUs 0-1 for large models, GPU 2 for agent, GPU 3 for embedding/reranker
3. **Checkpoints**: Session-level tracking in `checkpoint.json` with `sessions_per_profile` dict
4. **Results**: JSON format in `results.json` with metrics per session

---

## Contact

For questions about this codebase, refer to the experiment plan at:
`/u/yurenh2/.claude/plans/effervescent-mapping-ocean.md`