# Experiment Design: RAG-based Personalization vs Baselines ## Goal Demonstrate that **Extractor + RAG + User Vector** outperforms all baselines, especially on: 1. **Long-context scenarios** where context-based methods overflow/degrade 2. **Conflict resolution** where RAG naturally retrieves the correct preference 3. **User modeling quality** where user vectors capture true user identity --- ## Baselines | # | Method | Memory Type | Retrieval | User Model | |---|--------|-------------|-----------|------------| | 1 | Vanilla | None | - | - | | 2 | Contextual Memory | Full history | Recency | - | | 3 | Reflection Memory | Agent notes | - | - | | 4 | Reflection + GRPO | Trained notes | - | - | | 5 | All Memory Cards | Extracted prefs | None (all in context) | - | | 6 | Extractor + RAG | Extracted prefs | Dense + Rerank | - | | 7 | **Extractor + RAG + User Vector** | Extracted prefs | Policy-based | z_long, z_short | --- ## User Profile Design ### Complexity: 40 Conditional Preferences Each preference has: ```json { "pref_id": "cs_001", "condition": "writing Python code", "action": "use snake_case, include type hints", "conflict_group": "naming_convention", "priority_context": ["python", ".py", "def"] } ``` ### 8 Categories - Response Format (4 prefs) - Verbosity (5 prefs) - Code Style (8 prefs) - Math Style (6 prefs) - Interaction Pattern (6 prefs) - Domain-Specific (6 prefs) - Error Correction (4 prefs) - Output Artifacts (4 prefs) ### 15 Conflict Groups Pairs of preferences that could BOTH apply to a query, but only ONE is correct given context. --- ## Conflict Scenario Design ### Key Insight RAG naturally resolves conflicts because it retrieves based on **query similarity**: - Query: "Quick question - explain backpropagation" - Pref A: "When I say 'quick', be concise" (HIGH similarity to "quick question") - Pref B: "For complex ML topics, explain in detail" (lower similarity) - RAG retrieves Pref A → correct! - Context methods see BOTH → confused, might give long response ### Conflict Categories | Conflict Type | Example | Why RAG Wins | |---------------|---------|--------------| | Format (bullets vs numbered) | "Steps to deploy + list mistakes" | Retrieves format per section | | Verbosity (concise vs detailed) | "Quick question - how does X work?" | Explicit brevity cue retrieved | | Naming (snake vs camel) | "Write in Python and JavaScript" | Language detection per block | | Answer position | "What is X and why?" | "What is" retrieves answer-first | | Autonomy | "Refactor this" vs "Change line 5" | Scope detection | | Example position | "How to use API" vs "What is recursion" | Practical vs theoretical | ### Why Context Methods Fail 1. **All prefs visible**: Sees both "be concise" AND "explain complex topics in detail" 2. **No ranking**: Can't determine which applies more to this query 3. **Tries to satisfy both**: Gives medium-length response satisfying neither 4. **Inconsistent**: Sometimes picks wrong one based on position in context --- ## Challenging Datasets ### Existing (keep) - `math-500` (HuggingFaceH4/MATH-500) - `humaneval` (code generation) - `bigcodebench` (code completion) ### New (add for difficulty) | Dataset | Domain | Why Challenging | |---------|--------|-----------------| | GPQA | PhD-level science | Complex multi-step reasoning | | TheoremQA | Math proofs | Requires structured proof format | | LiveCodeBench | Competitive programming | Recent, hard problems | | AIME | Math competition | Olympiad-level difficulty | | SciCode | Scientific computing | Domain expertise needed | ### Step-by-Step Encouragement Add prompts that encourage longer, multi-turn problem solving: ``` "Work through this carefully: 1. Identify the problem type 2. State key concepts needed 3. Solve step by step 4. Verify your answer" ``` This creates more opportunities for preferences to be revealed and conflicts to occur. --- ## Session Structure ### Multi-Session Accumulation ``` Session 1: User reveals ~4 format/verbosity preferences Session 2: User reveals ~4 code style preferences Session 3: User reveals ~4 math preferences ... Session 10: All 40 preferences established ``` ### Per Session - 3-5 problems per session - 10-15 turns per problem (step-by-step solving) - Total: ~50 turns per session → ~500 turns after 10 sessions ### Context Accumulation - After 10 sessions: ~50,000+ tokens of history - Contextual Memory baseline MUST summarize → loses specific preferences - All Memory Cards baseline: 40 prefs × 50 tokens = 2000 tokens (manageable but conflicting) - RAG: only retrieves ~5 relevant prefs per query --- ## Metrics ### Primary Metrics (from their paper) | Metric | Description | Lower/Higher Better | |--------|-------------|---------------------| | Task Accuracy | % of problems solved correctly | Higher | | User Effort | User token count | Lower | | Efficiency | Total tokens used | Lower | ### New Metrics (to prove our hypotheses) | Metric | Description | What It Shows | |--------|-------------|---------------| | Conflict Resolution Accuracy | % of conflict scenarios resolved correctly | RAG's advantage | | Preference Enforcement Rate | How often user had to correct agent | Quality of personalization | | Context Degradation | Accuracy drop as sessions increase | Scalability | | User Vector Similarity | Cosine sim between z_long and ground truth | User modeling quality | | Cross-User Discrimination | Different users have different vectors | Vectors capture identity | --- ## Expected Results ### Hypothesis 1: RAG beats context on scalability ``` Session 1-3: All methods ~similar accuracy Session 4-6: Contextual Memory starts degrading Session 7-10: Contextual Memory significantly worse RAG methods: Consistent accuracy throughout ``` ### Hypothesis 2: RAG beats all-in-context on conflicts ``` Conflict Resolution Accuracy: - Vanilla: N/A (no personalization) - Contextual Memory: ~40% (random when sees both prefs) - All Memory Cards: ~50% (sees all, sometimes picks right one) - Extractor + RAG: ~70% (retrieves relevant, but sometimes both) - Extractor + RAG + User Vector: ~85% (vector learns which wins) ``` ### Hypothesis 3: User vectors model users well ``` Intra-user similarity (same user, different sessions): >0.9 Inter-user similarity (different users): <0.5 Cluster quality: Users with similar preferences cluster together ``` ### Hypothesis 4: Lower user effort with better personalization ``` User Tokens (lower = better): - Vanilla: 100% (baseline) - Contextual Memory: 90% - All Memory Cards: 85% - Extractor + RAG: 70% - Extractor + RAG + User Vector: 60% ``` --- ## Experiment Execution ### Phase 1: Profile Generation ```bash python scripts/generate_complex_profiles.py \ --num_profiles 50 \ --output_dir collaborativeagents/data/complex_profiles \ --model gpt-4o-mini ``` ### Phase 2: Conflict Scenario Generation ```bash python scripts/conflict_scenario_generator.py \ --profiles collaborativeagents/data/complex_profiles/profiles.jsonl \ --output collaborativeagents/data/conflict_scenarios.jsonl \ --scenarios_per_conflict 5 ``` ### Phase 3: Baseline Comparison ```bash python scripts/run_baseline_comparison.py \ --baselines vanilla contextual_memory reflection_memory \ all_memory_cards extractor_rag extractor_rag_user_vector \ --dataset math-500 \ --num_sessions 10 \ --num_users 50 \ --output_dir collaborativeagents/results ``` ### Phase 4: Analysis - Degradation curves (accuracy vs session number) - Conflict resolution accuracy by type - User vector quality analysis - Token efficiency comparison --- ## Files | File | Purpose | |------|---------| | `data/preference_schema_v2_sample.json` | Sample preference schema | | `scripts/generate_complex_profiles.py` | LLM-based profile generation | | `scripts/conflict_scenario_generator.py` | Conflict test generation | | `scripts/run_baseline_comparison.py` | Experiment runner | | `results/` | Output directory | --- ## Next Steps 1. [ ] Generate 50 complex user profiles 2. [ ] Generate conflict scenarios for each profile 3. [ ] Integrate your PersonalizedLLM as the RAG + User Vector baseline 4. [ ] Implement adapter for CollaborativeAgents conversation flow 5. [ ] Run experiments on math-500 first (validate setup) 6. [ ] Add challenging datasets (GPQA, TheoremQA, etc.) 7. [ ] Run full comparison experiments 8. [ ] Generate analysis plots and tables