collaborativeagents/INTERNAL_NOTES.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367

# Internal Project Notes: Personalization Benchmark

> **Purpose**: Internal documentation for AI assistants and developers to understand project state, design decisions, and next steps.

---

## 1. Project Overview

### What This Project Is
A benchmark to evaluate **retrieval-based personalization** (user's main project) against various baselines using a **user simulator framework** (adapted from CollaborativeAgents/MULTISESSIONCOLLAB).

### Core Hypothesis
**Extractor + RAG + User Vector** outperforms all baselines because:
1. **Scalability**: RAG retrieves only relevant preferences (doesn't overflow context)
2. **Conflict Resolution**: User vector learns which preferences apply in which situations
3. **Avoids Over-personalization**: Doesn't apply irrelevant preferences

### User's Main Project Location
`/projects/bfqt/users/yurenh2/ml-projects/personalization-user-model/src/personalization/`

Key components:
- `serving/personalized_llm.py` - Main interface with `chat()`, `apply_feedback()`, `reset_session()`
- `retrieval/pipeline.py` - Dense retrieval + reranking + policy-based selection
- `user_model/tensor_store.py` - Dual vectors: z_long (persistent), z_short (session)
- `feedback/policy/reinforce.py` - REINFORCE updates from inferred rewards

---

## 2. Experiment Design

### Baselines (7 methods)

| Method | Description | Expected Weakness |
|--------|-------------|-------------------|
| `vanilla` | No memory | Forgets everything |
| `contextual` | Full history in context, summarize at limit | Summary loses specific preferences |
| `reflection` | CollaborativeAgents' agent_notes | Notes become unwieldy |
| `reflection_grpo` | Reflection + GRPO training | Better but no selective retrieval |
| `all_memory` | All extracted memories in context | Over-personalization, conflict confusion |
| `rag` | Extractor + RAG (no user vector) | Good retrieval but random on conflicts |
| `rag_vector` | Extractor + RAG + user vector | **Proposed best method** |

### Datasets

**Existing (from CollaborativeAgents):**
- math-500, math-hard, humaneval, bigcodebench, logiqa, mmlu, medqa

**New challenging datasets (added):**
- `gpqa` - PhD-level science (extremely hard)
- `theoremqa` - Theorem-based math proofs
- `aime` - Competition math (answers 0-999)
- `livecodebench` - Recent competitive programming
- `scicode` - Scientific computing

### Metrics

| Metric | Description | Source |
|--------|-------------|--------|
| Task Success | Did agent solve the problem? | LLM judge |
| User Effort | User's token count + enforcement count | Automatic |
| Efficiency | Total tokens used | Automatic |
| Conflict Resolution Accuracy | Picked correct preference in conflicts? | LLM judge |
| Over-personalization Rate | Applied irrelevant preferences? | LLM judge |
| Preference Compliance | Followed applicable preferences? | LLM judge |

### User Profiles

**Key change from original:**
- Original: 3 flat preferences per user (too simple)
- Ours: **~40 conditional preferences** with explicit conditions

Example:
```json
{
  "condition": "debugging code",
  "preference": "show full stack traces and line numbers",
  "conflict_group": "code_detail"
}
```

**Conflict groups** (15 types):
- format_structure (bullets vs numbered)
- verbosity_level (concise vs detailed)
- code_naming (snake_case vs camelCase by language)
- math_detail (derivations vs final answer)
- guidance_style (incremental vs holistic)
- etc.

### Conflict Testing Design

**Critical for proving hypothesis.** Every conflict query:
1. Triggers 2+ conflicting preferences
2. Has ONE correct preference based on context
3. RAG naturally retrieves the correct one
4. Context methods see ALL and get confused

Example:
```
Query: "Quick question - how does backpropagation work?"

Triggered preferences:
  ✓ "When I say 'quick', be concise" (matches query signal)
  ✗ "For complex ML topics, explain in detail" (lower relevance)

RAG retrieves: concise preference (CORRECT)
Context methods see: BOTH → confused → often wrong
```

---

## 3. Technical Architecture

### Models
- **User Simulator**: Llama-3.3-70B-Instruct (powerful, simulates realistic user)
- **Agent**: Llama-3.1-8B-Instruct (smaller, realistic deployment)
- **LLM Judge**: Llama-3.3-70B-Instruct (evaluates quality)

### Compute Resources (from sinfo)
- `gpu` partition: 4x A100 80GB
- `h100` partition: 4x H100 80GB
- Good for running 70B models

### Session Structure
- 100 user profiles
- 20 sessions per profile
- 15 max turns per session
- 30% of queries are conflict tests

---

## 4. Files Created

### Core Experiment Framework

| File | Status | Purpose |
|------|--------|---------|
| `scripts/run_experiments.py` | DONE | Main orchestrator - runs all methods, evaluates, generates report |
| `scripts/run_baseline_comparison.py` | DONE | Baseline comparison runner |
| `scripts/conflict_scenario_generator.py` | DONE | Generates 70+ conflict templates |
| `evaluation/llm_judge.py` | DONE | LLM-as-judge for all metrics |
| `evaluation/__init__.py` | DONE | Module exports |

### Profile Generation

| File | Status | Purpose |
|------|--------|---------|
| `data/preference_schema_v2_sample.json` | DONE | Sample schema with 40 preferences, 15 conflict groups |
| `scripts/generate_complex_profiles.py` | DONE | LLM-based batch generation |
| `scripts/generate_profiles_v2.py` | DONE | Profile generation with quality control |

### Dataset & Prompts

| File | Status | Purpose |
|------|--------|---------|
| `datasets_extended.py` | DONE | New challenging datasets (GPQA, AIME, etc.) |
| `prompts_extended.py` | DONE | Step-by-step prompts to make sessions longer |

### Integration

| File | Status | Purpose |
|------|--------|---------|
| `adapters/personalized_llm_adapter.py` | DONE | Wraps PersonalizedLLM for benchmark |

### Documentation

| File | Status | Purpose |
|------|--------|---------|
| `EXPERIMENT_DESIGN.md` | DONE | Human-readable design doc |
| `INTERNAL_NOTES.md` | DONE | This file - AI/developer notes |

---

## 5. What's TBD (To Be Done)

### High Priority

1. **Generate 100 User Profiles**
   - Run `scripts/generate_profiles_v2.py` with LLM
   - Need SLURM script for cluster
   - Output: `data/complex_profiles_100.json`

2. **End-to-End Integration Test**
   - Verify PersonalizedLLM adapter works with conversation generator
   - Test with 2-3 profiles, 1-2 sessions each
   - Fix any import/path issues

3. **SLURM Scripts for Cluster**
   - Profile generation job
   - Main experiment job (may need multi-GPU)

### Medium Priority

4. **Baseline Adapter Implementations**
   - `vanilla` - simple, no memory
   - `contextual` - needs summarization logic
   - `reflection` - use existing CollaborativeAgents code
   - `reflection_grpo` - need trained model
   - `all_memory` - dump all memories in context

5. **User Simulator Integration**
   - Ensure user agent enforces preferences
   - Ensure user agent expresses disappointment
   - Both should count as negative signals

### Lower Priority

6. **User Vector Similarity Analysis**
   - Create ground truth vectors from preference categories
   - Compare learned z_long to ground truth
   - Proves user vectors capture identity

7. **Visualization & Reporting**
   - Token usage graphs over sessions
   - Conflict resolution accuracy by type
   - User effort reduction curves

---

## 6. Key Design Decisions (Reference)

### Why 40 Preferences?
- Original 3 is too simple - any method can handle it
- 40 creates realistic complexity
- Forces context-based methods to overflow or summarize (losing info)
- Creates many conflict opportunities

### Why Conditional Preferences?
- Real users have context-dependent preferences
- "Be concise WHEN I'm rushed" vs "Explain thoroughly WHEN I'm learning"
- RAG naturally handles this via query similarity
- Context methods struggle with conflicting conditions

### Why Step-by-Step Prompts?
- Makes sessions longer (more turns)
- More opportunities for preference expression/violation
- Generates more context (stresses context-based methods)
- More realistic for complex problems

### Why LLM-as-Judge?
- Preference compliance is subjective
- Can't programmatically check "was this concise enough?"
- LLM judge provides nuanced evaluation
- Using 70B model for quality

---

## 7. Code Locations Reference

### User's Personalization System
```
src/personalization/
├── serving/personalized_llm.py      # Main API
├── retrieval/pipeline.py            # RAG pipeline
├── user_model/tensor_store.py       # User vectors
├── feedback/policy/reinforce.py     # Online learning
├── models/preference_extractor/     # Extraction
└── configs/                         # Config files
```

### CollaborativeAgents (Original)
```
collaborativeagents/collaborativeagents/
├── agents/user_agent.py             # User simulator
├── agents/collaborator_agent.py     # Agent interface
├── conversation_generator.py        # Conversation loop
├── conversation_evaluator.py        # Evaluation
├── prompts.py                       # Original prompts
└── datasets/                        # Original datasets
```

### Our Extensions
```
collaborativeagents/
├── scripts/
│   ├── run_experiments.py           # Main entry point
│   ├── generate_profiles_v2.py      # Profile generation
│   └── conflict_scenario_generator.py
├── evaluation/
│   └── llm_judge.py                 # LLM evaluation
├── adapters/
│   └── personalized_llm_adapter.py  # Integration
├── datasets_extended.py             # New datasets
├── prompts_extended.py              # Step-by-step prompts
└── data/
    └── preference_schema_v2_sample.json
```

---

## 8. Quick Start Commands (TBD - update after testing)

```bash
# Generate profiles (need SLURM script)
python scripts/generate_profiles_v2.py --n-profiles 100 --output data/profiles.json

# Run experiments
python scripts/run_experiments.py \
  --methods vanilla,contextual,rag,rag_vector \
  --datasets gpqa,aime \
  --n-profiles 100 \
  --n-sessions 20 \
  --output-dir results/

# Quick test (2 profiles, 2 sessions)
python scripts/run_experiments.py \
  --methods rag_vector \
  --datasets math-500 \
  --n-profiles 2 \
  --n-sessions 2 \
  --output-dir results/test/
```

---

## 9. Dataset & Profile Generation Details

### Datasets
**Status**: NOT pre-downloaded. Uses HuggingFace `datasets` library.
- Will be downloaded on first use via `load_dataset()`
- Cached locally after first download
- Some datasets may require authentication (GPQA requires HuggingFace login)

**To pre-download** (optional):
```python
from datasets import load_dataset
datasets_to_download = [
    ("HuggingFaceH4/MATH-500", None),
    ("lighteval/MATH-Hard", None),
    ("openai/openai_humaneval", None),
    ("bigcode/bigcodebench", None),
    ("Idavidrein/gpqa", "gpqa_diamond"),  # May need HF token
    ("TIGER-Lab/TheoremQA", None),
    # etc.
]
for name, config in datasets_to_download:
    load_dataset(name, config)
```

### Profile Generation
**LLM Used**: `meta-llama/Llama-3.1-70B-Instruct` via `litellm`
- Requires litellm installed: `pip install litellm`
- Needs API endpoint or local model setup
- Fallback: `generate_from_schema()` function creates profiles from predefined schema (no LLM needed)

**Generation command**:
```bash
# With LLM
python scripts/generate_profiles_v2.py --n-profiles 100 --model meta-llama/Llama-3.1-70B-Instruct

# Without LLM (from schema)
python scripts/generate_profiles_v2.py --n-profiles 100 --from-schema data/preference_schema_v2_sample.json
```

---

## 10. Notes for Future Sessions

- User prefers 100 profiles but is flexible based on compute
- User wants all new challenging datasets included
- User wants LLM as judge (not programmatic)
- User LLM should be Llama-70B, Agent should be Llama-8B
- PDF in `collaborativeagents/` folder describes original experiments
- User's PersonalizedLLM has easy-to-use `chat()` function - prefer using it

---

*Last updated: Session where datasets_extended.py and llm_judge.py were created*