collaborativeagents/EXPERIMENT_DESIGN.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258

# Experiment Design: RAG-based Personalization vs Baselines

## Goal
Demonstrate that **Extractor + RAG + User Vector** outperforms all baselines, especially on:
1. **Long-context scenarios** where context-based methods overflow/degrade
2. **Conflict resolution** where RAG naturally retrieves the correct preference
3. **User modeling quality** where user vectors capture true user identity

---

## Baselines

| # | Method | Memory Type | Retrieval | User Model |
|---|--------|-------------|-----------|------------|
| 1 | Vanilla | None | - | - |
| 2 | Contextual Memory | Full history | Recency | - |
| 3 | Reflection Memory | Agent notes | - | - |
| 4 | Reflection + GRPO | Trained notes | - | - |
| 5 | All Memory Cards | Extracted prefs | None (all in context) | - |
| 6 | Extractor + RAG | Extracted prefs | Dense + Rerank | - |
| 7 | **Extractor + RAG + User Vector** | Extracted prefs | Policy-based | z_long, z_short |

---

## User Profile Design

### Complexity: 40 Conditional Preferences

Each preference has:
```json
{
  "pref_id": "cs_001",
  "condition": "writing Python code",
  "action": "use snake_case, include type hints",
  "conflict_group": "naming_convention",
  "priority_context": ["python", ".py", "def"]
}
```

### 8 Categories
- Response Format (4 prefs)
- Verbosity (5 prefs)
- Code Style (8 prefs)
- Math Style (6 prefs)
- Interaction Pattern (6 prefs)
- Domain-Specific (6 prefs)
- Error Correction (4 prefs)
- Output Artifacts (4 prefs)

### 15 Conflict Groups
Pairs of preferences that could BOTH apply to a query, but only ONE is correct given context.

---

## Conflict Scenario Design

### Key Insight
RAG naturally resolves conflicts because it retrieves based on **query similarity**:
- Query: "Quick question - explain backpropagation"
- Pref A: "When I say 'quick', be concise" (HIGH similarity to "quick question")
- Pref B: "For complex ML topics, explain in detail" (lower similarity)
- RAG retrieves Pref A → correct!
- Context methods see BOTH → confused, might give long response

### Conflict Categories

| Conflict Type | Example | Why RAG Wins |
|---------------|---------|--------------|
| Format (bullets vs numbered) | "Steps to deploy + list mistakes" | Retrieves format per section |
| Verbosity (concise vs detailed) | "Quick question - how does X work?" | Explicit brevity cue retrieved |
| Naming (snake vs camel) | "Write in Python and JavaScript" | Language detection per block |
| Answer position | "What is X and why?" | "What is" retrieves answer-first |
| Autonomy | "Refactor this" vs "Change line 5" | Scope detection |
| Example position | "How to use API" vs "What is recursion" | Practical vs theoretical |

### Why Context Methods Fail

1. **All prefs visible**: Sees both "be concise" AND "explain complex topics in detail"
2. **No ranking**: Can't determine which applies more to this query
3. **Tries to satisfy both**: Gives medium-length response satisfying neither
4. **Inconsistent**: Sometimes picks wrong one based on position in context

---

## Challenging Datasets

### Existing (keep)
- `math-500` (HuggingFaceH4/MATH-500)
- `humaneval` (code generation)
- `bigcodebench` (code completion)

### New (add for difficulty)
| Dataset | Domain | Why Challenging |
|---------|--------|-----------------|
| GPQA | PhD-level science | Complex multi-step reasoning |
| TheoremQA | Math proofs | Requires structured proof format |
| LiveCodeBench | Competitive programming | Recent, hard problems |
| AIME | Math competition | Olympiad-level difficulty |
| SciCode | Scientific computing | Domain expertise needed |

### Step-by-Step Encouragement
Add prompts that encourage longer, multi-turn problem solving:
```
"Work through this carefully:
1. Identify the problem type
2. State key concepts needed
3. Solve step by step
4. Verify your answer"
```

This creates more opportunities for preferences to be revealed and conflicts to occur.

---

## Session Structure

### Multi-Session Accumulation
```
Session 1: User reveals ~4 format/verbosity preferences
Session 2: User reveals ~4 code style preferences
Session 3: User reveals ~4 math preferences
...
Session 10: All 40 preferences established
```

### Per Session
- 3-5 problems per session
- 10-15 turns per problem (step-by-step solving)
- Total: ~50 turns per session → ~500 turns after 10 sessions

### Context Accumulation
- After 10 sessions: ~50,000+ tokens of history
- Contextual Memory baseline MUST summarize → loses specific preferences
- All Memory Cards baseline: 40 prefs × 50 tokens = 2000 tokens (manageable but conflicting)
- RAG: only retrieves ~5 relevant prefs per query

---

## Metrics

### Primary Metrics (from their paper)

| Metric | Description | Lower/Higher Better |
|--------|-------------|---------------------|
| Task Accuracy | % of problems solved correctly | Higher |
| User Effort | User token count | Lower |
| Efficiency | Total tokens used | Lower |

### New Metrics (to prove our hypotheses)

| Metric | Description | What It Shows |
|--------|-------------|---------------|
| Conflict Resolution Accuracy | % of conflict scenarios resolved correctly | RAG's advantage |
| Preference Enforcement Rate | How often user had to correct agent | Quality of personalization |
| Context Degradation | Accuracy drop as sessions increase | Scalability |
| User Vector Similarity | Cosine sim between z_long and ground truth | User modeling quality |
| Cross-User Discrimination | Different users have different vectors | Vectors capture identity |

---

## Expected Results

### Hypothesis 1: RAG beats context on scalability
```
Session 1-3:  All methods ~similar accuracy
Session 4-6:  Contextual Memory starts degrading
Session 7-10: Contextual Memory significantly worse
RAG methods: Consistent accuracy throughout
```

### Hypothesis 2: RAG beats all-in-context on conflicts
```
Conflict Resolution Accuracy:
- Vanilla: N/A (no personalization)
- Contextual Memory: ~40% (random when sees both prefs)
- All Memory Cards: ~50% (sees all, sometimes picks right one)
- Extractor + RAG: ~70% (retrieves relevant, but sometimes both)
- Extractor + RAG + User Vector: ~85% (vector learns which wins)
```

### Hypothesis 3: User vectors model users well
```
Intra-user similarity (same user, different sessions): >0.9
Inter-user similarity (different users): <0.5
Cluster quality: Users with similar preferences cluster together
```

### Hypothesis 4: Lower user effort with better personalization
```
User Tokens (lower = better):
- Vanilla: 100% (baseline)
- Contextual Memory: 90%
- All Memory Cards: 85%
- Extractor + RAG: 70%
- Extractor + RAG + User Vector: 60%
```

---

## Experiment Execution

### Phase 1: Profile Generation
```bash
python scripts/generate_complex_profiles.py \
  --num_profiles 50 \
  --output_dir collaborativeagents/data/complex_profiles \
  --model gpt-4o-mini
```

### Phase 2: Conflict Scenario Generation
```bash
python scripts/conflict_scenario_generator.py \
  --profiles collaborativeagents/data/complex_profiles/profiles.jsonl \
  --output collaborativeagents/data/conflict_scenarios.jsonl \
  --scenarios_per_conflict 5
```

### Phase 3: Baseline Comparison
```bash
python scripts/run_baseline_comparison.py \
  --baselines vanilla contextual_memory reflection_memory \
             all_memory_cards extractor_rag extractor_rag_user_vector \
  --dataset math-500 \
  --num_sessions 10 \
  --num_users 50 \
  --output_dir collaborativeagents/results
```

### Phase 4: Analysis
- Degradation curves (accuracy vs session number)
- Conflict resolution accuracy by type
- User vector quality analysis
- Token efficiency comparison

---

## Files

| File | Purpose |
|------|---------|
| `data/preference_schema_v2_sample.json` | Sample preference schema |
| `scripts/generate_complex_profiles.py` | LLM-based profile generation |
| `scripts/conflict_scenario_generator.py` | Conflict test generation |
| `scripts/run_baseline_comparison.py` | Experiment runner |
| `results/` | Output directory |

---

## Next Steps

1. [ ] Generate 50 complex user profiles
2. [ ] Generate conflict scenarios for each profile
3. [ ] Integrate your PersonalizedLLM as the RAG + User Vector baseline
4. [ ] Implement adapter for CollaborativeAgents conversation flow
5. [ ] Run experiments on math-500 first (validate setup)
6. [ ] Add challenging datasets (GPQA, TheoremQA, etc.)
7. [ ] Run full comparison experiments
8. [ ] Generate analysis plots and tables