# Experiment Design: RAG-based Personalization vs Baselines

## Goal
Demonstrate that **Extractor + RAG + User Vector** outperforms all baselines, especially on:
1. **Long-context scenarios** where context-based methods overflow/degrade
2. **Conflict resolution** where RAG naturally retrieves the correct preference
3. **User modeling quality** where user vectors capture true user identity

---

## Baselines

| # | Method | Memory Type | Retrieval | User Model |
|---|--------|-------------|-----------|------------|
| 1 | Vanilla | None | - | - |
| 2 | Contextual Memory | Full history | Recency | - |
| 3 | Reflection Memory | Agent notes | - | - |
| 4 | Reflection + GRPO | Trained notes | - | - |
| 5 | All Memory Cards | Extracted prefs | None (all in context) | - |
| 6 | Extractor + RAG | Extracted prefs | Dense + Rerank | - |
| 7 | **Extractor + RAG + User Vector** | Extracted prefs | Policy-based | z_long, z_short |

---

## User Profile Design

### Complexity: 40 Conditional Preferences

Each preference has:
```json
{
  "pref_id": "cs_001",
  "condition": "writing Python code",
  "action": "use snake_case, include type hints",
  "conflict_group": "naming_convention",
  "priority_context": ["python", ".py", "def"]
}
```

### 8 Categories
- Response Format (4 prefs)
- Verbosity (5 prefs)
- Code Style (8 prefs)
- Math Style (6 prefs)
- Interaction Pattern (6 prefs)
- Domain-Specific (6 prefs)
- Error Correction (4 prefs)
- Output Artifacts (4 prefs)

### 15 Conflict Groups
Pairs of preferences that could BOTH apply to a query, but only ONE is correct given context.

---

## Conflict Scenario Design

### Key Insight
RAG naturally resolves conflicts because it retrieves based on **query similarity**:
- Query: "Quick question - explain backpropagation"
- Pref A: "When I say 'quick', be concise" (HIGH similarity to "quick question")
- Pref B: "For complex ML topics, explain in detail" (lower similarity)
- RAG retrieves Pref A → correct!
- Context methods see BOTH → confused, might give long response

### Conflict Categories

| Conflict Type | Example | Why RAG Wins |
|---------------|---------|--------------|
| Format (bullets vs numbered) | "Steps to deploy + list mistakes" | Retrieves format per section |
| Verbosity (concise vs detailed) | "Quick question - how does X work?" | Explicit brevity cue retrieved |
| Naming (snake vs camel) | "Write in Python and JavaScript" | Language detection per block |
| Answer position | "What is X and why?" | "What is" retrieves answer-first |
| Autonomy | "Refactor this" vs "Change line 5" | Scope detection |
| Example position | "How to use API" vs "What is recursion" | Practical vs theoretical |

### Why Context Methods Fail

1. **All prefs visible**: Sees both "be concise" AND "explain complex topics in detail"
2. **No ranking**: Can't determine which applies more to this query
3. **Tries to satisfy both**: Gives medium-length response satisfying neither
4. **Inconsistent**: Sometimes picks wrong one based on position in context

---

## Challenging Datasets

### Existing (keep)
- `math-500` (HuggingFaceH4/MATH-500)
- `humaneval` (code generation)
- `bigcodebench` (code completion)

### New (add for difficulty)
| Dataset | Domain | Why Challenging |
|---------|--------|-----------------|
| GPQA | PhD-level science | Complex multi-step reasoning |
| TheoremQA | Math proofs | Requires structured proof format |
| LiveCodeBench | Competitive programming | Recent, hard problems |
| AIME | Math competition | Olympiad-level difficulty |
| SciCode | Scientific computing | Domain expertise needed |

### Step-by-Step Encouragement
Add prompts that encourage longer, multi-turn problem solving:
```
"Work through this carefully:
1. Identify the problem type
2. State key concepts needed
3. Solve step by step
4. Verify your answer"
```

This creates more opportunities for preferences to be revealed and conflicts to occur.

---

## Session Structure

### Multi-Session Accumulation
```
Session 1: User reveals ~4 format/verbosity preferences
Session 2: User reveals ~4 code style preferences
Session 3: User reveals ~4 math preferences
...
Session 10: All 40 preferences established
```

### Per Session
- 3-5 problems per session
- 10-15 turns per problem (step-by-step solving)
- Total: ~50 turns per session → ~500 turns after 10 sessions

### Context Accumulation
- After 10 sessions: ~50,000+ tokens of history
- Contextual Memory baseline MUST summarize → loses specific preferences
- All Memory Cards baseline: 40 prefs × 50 tokens = 2000 tokens (manageable but conflicting)
- RAG: only retrieves ~5 relevant prefs per query

---

## Metrics

### Primary Metrics (from their paper)

| Metric | Description | Lower/Higher Better |
|--------|-------------|---------------------|
| Task Accuracy | % of problems solved correctly | Higher |
| User Effort | User token count | Lower |
| Efficiency | Total tokens used | Lower |

### New Metrics (to prove our hypotheses)

| Metric | Description | What It Shows |
|--------|-------------|---------------|
| Conflict Resolution Accuracy | % of conflict scenarios resolved correctly | RAG's advantage |
| Preference Enforcement Rate | How often user had to correct agent | Quality of personalization |
| Context Degradation | Accuracy drop as sessions increase | Scalability |
| User Vector Similarity | Cosine sim between z_long and ground truth | User modeling quality |
| Cross-User Discrimination | Different users have different vectors | Vectors capture identity |

---

## Expected Results

### Hypothesis 1: RAG beats context on scalability
```
Session 1-3:  All methods ~similar accuracy
Session 4-6:  Contextual Memory starts degrading
Session 7-10: Contextual Memory significantly worse
RAG methods: Consistent accuracy throughout
```

### Hypothesis 2: RAG beats all-in-context on conflicts
```
Conflict Resolution Accuracy:
- Vanilla: N/A (no personalization)
- Contextual Memory: ~40% (random when sees both prefs)
- All Memory Cards: ~50% (sees all, sometimes picks right one)
- Extractor + RAG: ~70% (retrieves relevant, but sometimes both)
- Extractor + RAG + User Vector: ~85% (vector learns which wins)
```

### Hypothesis 3: User vectors model users well
```
Intra-user similarity (same user, different sessions): >0.9
Inter-user similarity (different users): <0.5
Cluster quality: Users with similar preferences cluster together
```

### Hypothesis 4: Lower user effort with better personalization
```
User Tokens (lower = better):
- Vanilla: 100% (baseline)
- Contextual Memory: 90%
- All Memory Cards: 85%
- Extractor + RAG: 70%
- Extractor + RAG + User Vector: 60%
```

---

## Experiment Execution

### Phase 1: Profile Generation
```bash
python scripts/generate_complex_profiles.py \
  --num_profiles 50 \
  --output_dir collaborativeagents/data/complex_profiles \
  --model gpt-4o-mini
```

### Phase 2: Conflict Scenario Generation
```bash
python scripts/conflict_scenario_generator.py \
  --profiles collaborativeagents/data/complex_profiles/profiles.jsonl \
  --output collaborativeagents/data/conflict_scenarios.jsonl \
  --scenarios_per_conflict 5
```

### Phase 3: Baseline Comparison
```bash
python scripts/run_baseline_comparison.py \
  --baselines vanilla contextual_memory reflection_memory \
             all_memory_cards extractor_rag extractor_rag_user_vector \
  --dataset math-500 \
  --num_sessions 10 \
  --num_users 50 \
  --output_dir collaborativeagents/results
```

### Phase 4: Analysis
- Degradation curves (accuracy vs session number)
- Conflict resolution accuracy by type
- User vector quality analysis
- Token efficiency comparison

---

## Files

| File | Purpose |
|------|---------|
| `data/preference_schema_v2_sample.json` | Sample preference schema |
| `scripts/generate_complex_profiles.py` | LLM-based profile generation |
| `scripts/conflict_scenario_generator.py` | Conflict test generation |
| `scripts/run_baseline_comparison.py` | Experiment runner |
| `results/` | Output directory |

---

## Next Steps

1. [ ] Generate 50 complex user profiles
2. [ ] Generate conflict scenarios for each profile
3. [ ] Integrate your PersonalizedLLM as the RAG + User Vector baseline
4. [ ] Implement adapter for CollaborativeAgents conversation flow
5. [ ] Run experiments on math-500 first (validate setup)
6. [ ] Add challenging datasets (GPQA, TheoremQA, etc.)
7. [ ] Run full comparison experiments
8. [ ] Generate analysis plots and tables