summaryrefslogtreecommitdiff
path: root/collaborativeagents/EXPERIMENT_DESIGN.md
diff options
context:
space:
mode:
authorYurenHao0426 <blackhao0426@gmail.com>2026-01-27 09:57:37 -0600
committerYurenHao0426 <blackhao0426@gmail.com>2026-01-27 09:57:37 -0600
commitdc801c07cf38b0c495686463e6ca6f871a64440e (patch)
tree599f03114775921dbc472403c701f4a3a8ea188a /collaborativeagents/EXPERIMENT_DESIGN.md
parente43b3f8aa36c198b95c1e46bea2eaf3893b13dc3 (diff)
Add collaborativeagents module and update gitignore
- Add collaborativeagents subproject with adapters, agents, and evaluation modules - Update .gitignore to exclude large binary files (.whl, .tar), wandb logs, and results Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Diffstat (limited to 'collaborativeagents/EXPERIMENT_DESIGN.md')
-rw-r--r--collaborativeagents/EXPERIMENT_DESIGN.md258
1 files changed, 258 insertions, 0 deletions
diff --git a/collaborativeagents/EXPERIMENT_DESIGN.md b/collaborativeagents/EXPERIMENT_DESIGN.md
new file mode 100644
index 0000000..e2504b5
--- /dev/null
+++ b/collaborativeagents/EXPERIMENT_DESIGN.md
@@ -0,0 +1,258 @@
+# Experiment Design: RAG-based Personalization vs Baselines
+
+## Goal
+Demonstrate that **Extractor + RAG + User Vector** outperforms all baselines, especially on:
+1. **Long-context scenarios** where context-based methods overflow/degrade
+2. **Conflict resolution** where RAG naturally retrieves the correct preference
+3. **User modeling quality** where user vectors capture true user identity
+
+---
+
+## Baselines
+
+| # | Method | Memory Type | Retrieval | User Model |
+|---|--------|-------------|-----------|------------|
+| 1 | Vanilla | None | - | - |
+| 2 | Contextual Memory | Full history | Recency | - |
+| 3 | Reflection Memory | Agent notes | - | - |
+| 4 | Reflection + GRPO | Trained notes | - | - |
+| 5 | All Memory Cards | Extracted prefs | None (all in context) | - |
+| 6 | Extractor + RAG | Extracted prefs | Dense + Rerank | - |
+| 7 | **Extractor + RAG + User Vector** | Extracted prefs | Policy-based | z_long, z_short |
+
+---
+
+## User Profile Design
+
+### Complexity: 40 Conditional Preferences
+
+Each preference has:
+```json
+{
+ "pref_id": "cs_001",
+ "condition": "writing Python code",
+ "action": "use snake_case, include type hints",
+ "conflict_group": "naming_convention",
+ "priority_context": ["python", ".py", "def"]
+}
+```
+
+### 8 Categories
+- Response Format (4 prefs)
+- Verbosity (5 prefs)
+- Code Style (8 prefs)
+- Math Style (6 prefs)
+- Interaction Pattern (6 prefs)
+- Domain-Specific (6 prefs)
+- Error Correction (4 prefs)
+- Output Artifacts (4 prefs)
+
+### 15 Conflict Groups
+Pairs of preferences that could BOTH apply to a query, but only ONE is correct given context.
+
+---
+
+## Conflict Scenario Design
+
+### Key Insight
+RAG naturally resolves conflicts because it retrieves based on **query similarity**:
+- Query: "Quick question - explain backpropagation"
+- Pref A: "When I say 'quick', be concise" (HIGH similarity to "quick question")
+- Pref B: "For complex ML topics, explain in detail" (lower similarity)
+- RAG retrieves Pref A → correct!
+- Context methods see BOTH → confused, might give long response
+
+### Conflict Categories
+
+| Conflict Type | Example | Why RAG Wins |
+|---------------|---------|--------------|
+| Format (bullets vs numbered) | "Steps to deploy + list mistakes" | Retrieves format per section |
+| Verbosity (concise vs detailed) | "Quick question - how does X work?" | Explicit brevity cue retrieved |
+| Naming (snake vs camel) | "Write in Python and JavaScript" | Language detection per block |
+| Answer position | "What is X and why?" | "What is" retrieves answer-first |
+| Autonomy | "Refactor this" vs "Change line 5" | Scope detection |
+| Example position | "How to use API" vs "What is recursion" | Practical vs theoretical |
+
+### Why Context Methods Fail
+
+1. **All prefs visible**: Sees both "be concise" AND "explain complex topics in detail"
+2. **No ranking**: Can't determine which applies more to this query
+3. **Tries to satisfy both**: Gives medium-length response satisfying neither
+4. **Inconsistent**: Sometimes picks wrong one based on position in context
+
+---
+
+## Challenging Datasets
+
+### Existing (keep)
+- `math-500` (HuggingFaceH4/MATH-500)
+- `humaneval` (code generation)
+- `bigcodebench` (code completion)
+
+### New (add for difficulty)
+| Dataset | Domain | Why Challenging |
+|---------|--------|-----------------|
+| GPQA | PhD-level science | Complex multi-step reasoning |
+| TheoremQA | Math proofs | Requires structured proof format |
+| LiveCodeBench | Competitive programming | Recent, hard problems |
+| AIME | Math competition | Olympiad-level difficulty |
+| SciCode | Scientific computing | Domain expertise needed |
+
+### Step-by-Step Encouragement
+Add prompts that encourage longer, multi-turn problem solving:
+```
+"Work through this carefully:
+1. Identify the problem type
+2. State key concepts needed
+3. Solve step by step
+4. Verify your answer"
+```
+
+This creates more opportunities for preferences to be revealed and conflicts to occur.
+
+---
+
+## Session Structure
+
+### Multi-Session Accumulation
+```
+Session 1: User reveals ~4 format/verbosity preferences
+Session 2: User reveals ~4 code style preferences
+Session 3: User reveals ~4 math preferences
+...
+Session 10: All 40 preferences established
+```
+
+### Per Session
+- 3-5 problems per session
+- 10-15 turns per problem (step-by-step solving)
+- Total: ~50 turns per session → ~500 turns after 10 sessions
+
+### Context Accumulation
+- After 10 sessions: ~50,000+ tokens of history
+- Contextual Memory baseline MUST summarize → loses specific preferences
+- All Memory Cards baseline: 40 prefs × 50 tokens = 2000 tokens (manageable but conflicting)
+- RAG: only retrieves ~5 relevant prefs per query
+
+---
+
+## Metrics
+
+### Primary Metrics (from their paper)
+
+| Metric | Description | Lower/Higher Better |
+|--------|-------------|---------------------|
+| Task Accuracy | % of problems solved correctly | Higher |
+| User Effort | User token count | Lower |
+| Efficiency | Total tokens used | Lower |
+
+### New Metrics (to prove our hypotheses)
+
+| Metric | Description | What It Shows |
+|--------|-------------|---------------|
+| Conflict Resolution Accuracy | % of conflict scenarios resolved correctly | RAG's advantage |
+| Preference Enforcement Rate | How often user had to correct agent | Quality of personalization |
+| Context Degradation | Accuracy drop as sessions increase | Scalability |
+| User Vector Similarity | Cosine sim between z_long and ground truth | User modeling quality |
+| Cross-User Discrimination | Different users have different vectors | Vectors capture identity |
+
+---
+
+## Expected Results
+
+### Hypothesis 1: RAG beats context on scalability
+```
+Session 1-3: All methods ~similar accuracy
+Session 4-6: Contextual Memory starts degrading
+Session 7-10: Contextual Memory significantly worse
+RAG methods: Consistent accuracy throughout
+```
+
+### Hypothesis 2: RAG beats all-in-context on conflicts
+```
+Conflict Resolution Accuracy:
+- Vanilla: N/A (no personalization)
+- Contextual Memory: ~40% (random when sees both prefs)
+- All Memory Cards: ~50% (sees all, sometimes picks right one)
+- Extractor + RAG: ~70% (retrieves relevant, but sometimes both)
+- Extractor + RAG + User Vector: ~85% (vector learns which wins)
+```
+
+### Hypothesis 3: User vectors model users well
+```
+Intra-user similarity (same user, different sessions): >0.9
+Inter-user similarity (different users): <0.5
+Cluster quality: Users with similar preferences cluster together
+```
+
+### Hypothesis 4: Lower user effort with better personalization
+```
+User Tokens (lower = better):
+- Vanilla: 100% (baseline)
+- Contextual Memory: 90%
+- All Memory Cards: 85%
+- Extractor + RAG: 70%
+- Extractor + RAG + User Vector: 60%
+```
+
+---
+
+## Experiment Execution
+
+### Phase 1: Profile Generation
+```bash
+python scripts/generate_complex_profiles.py \
+ --num_profiles 50 \
+ --output_dir collaborativeagents/data/complex_profiles \
+ --model gpt-4o-mini
+```
+
+### Phase 2: Conflict Scenario Generation
+```bash
+python scripts/conflict_scenario_generator.py \
+ --profiles collaborativeagents/data/complex_profiles/profiles.jsonl \
+ --output collaborativeagents/data/conflict_scenarios.jsonl \
+ --scenarios_per_conflict 5
+```
+
+### Phase 3: Baseline Comparison
+```bash
+python scripts/run_baseline_comparison.py \
+ --baselines vanilla contextual_memory reflection_memory \
+ all_memory_cards extractor_rag extractor_rag_user_vector \
+ --dataset math-500 \
+ --num_sessions 10 \
+ --num_users 50 \
+ --output_dir collaborativeagents/results
+```
+
+### Phase 4: Analysis
+- Degradation curves (accuracy vs session number)
+- Conflict resolution accuracy by type
+- User vector quality analysis
+- Token efficiency comparison
+
+---
+
+## Files
+
+| File | Purpose |
+|------|---------|
+| `data/preference_schema_v2_sample.json` | Sample preference schema |
+| `scripts/generate_complex_profiles.py` | LLM-based profile generation |
+| `scripts/conflict_scenario_generator.py` | Conflict test generation |
+| `scripts/run_baseline_comparison.py` | Experiment runner |
+| `results/` | Output directory |
+
+---
+
+## Next Steps
+
+1. [ ] Generate 50 complex user profiles
+2. [ ] Generate conflict scenarios for each profile
+3. [ ] Integrate your PersonalizedLLM as the RAG + User Vector baseline
+4. [ ] Implement adapter for CollaborativeAgents conversation flow
+5. [ ] Run experiments on math-500 first (validate setup)
+6. [ ] Add challenging datasets (GPQA, TheoremQA, etc.)
+7. [ ] Run full comparison experiments
+8. [ ] Generate analysis plots and tables