diff options
| author | YurenHao0426 <blackhao0426@gmail.com> | 2026-01-27 09:57:37 -0600 |
|---|---|---|
| committer | YurenHao0426 <blackhao0426@gmail.com> | 2026-01-27 09:57:37 -0600 |
| commit | dc801c07cf38b0c495686463e6ca6f871a64440e (patch) | |
| tree | 599f03114775921dbc472403c701f4a3a8ea188a /collaborativeagents/EXPERIMENT_DESIGN.md | |
| parent | e43b3f8aa36c198b95c1e46bea2eaf3893b13dc3 (diff) | |
Add collaborativeagents module and update gitignore
- Add collaborativeagents subproject with adapters, agents, and evaluation modules
- Update .gitignore to exclude large binary files (.whl, .tar), wandb logs, and results
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Diffstat (limited to 'collaborativeagents/EXPERIMENT_DESIGN.md')
| -rw-r--r-- | collaborativeagents/EXPERIMENT_DESIGN.md | 258 |
1 files changed, 258 insertions, 0 deletions
diff --git a/collaborativeagents/EXPERIMENT_DESIGN.md b/collaborativeagents/EXPERIMENT_DESIGN.md new file mode 100644 index 0000000..e2504b5 --- /dev/null +++ b/collaborativeagents/EXPERIMENT_DESIGN.md @@ -0,0 +1,258 @@ +# Experiment Design: RAG-based Personalization vs Baselines + +## Goal +Demonstrate that **Extractor + RAG + User Vector** outperforms all baselines, especially on: +1. **Long-context scenarios** where context-based methods overflow/degrade +2. **Conflict resolution** where RAG naturally retrieves the correct preference +3. **User modeling quality** where user vectors capture true user identity + +--- + +## Baselines + +| # | Method | Memory Type | Retrieval | User Model | +|---|--------|-------------|-----------|------------| +| 1 | Vanilla | None | - | - | +| 2 | Contextual Memory | Full history | Recency | - | +| 3 | Reflection Memory | Agent notes | - | - | +| 4 | Reflection + GRPO | Trained notes | - | - | +| 5 | All Memory Cards | Extracted prefs | None (all in context) | - | +| 6 | Extractor + RAG | Extracted prefs | Dense + Rerank | - | +| 7 | **Extractor + RAG + User Vector** | Extracted prefs | Policy-based | z_long, z_short | + +--- + +## User Profile Design + +### Complexity: 40 Conditional Preferences + +Each preference has: +```json +{ + "pref_id": "cs_001", + "condition": "writing Python code", + "action": "use snake_case, include type hints", + "conflict_group": "naming_convention", + "priority_context": ["python", ".py", "def"] +} +``` + +### 8 Categories +- Response Format (4 prefs) +- Verbosity (5 prefs) +- Code Style (8 prefs) +- Math Style (6 prefs) +- Interaction Pattern (6 prefs) +- Domain-Specific (6 prefs) +- Error Correction (4 prefs) +- Output Artifacts (4 prefs) + +### 15 Conflict Groups +Pairs of preferences that could BOTH apply to a query, but only ONE is correct given context. + +--- + +## Conflict Scenario Design + +### Key Insight +RAG naturally resolves conflicts because it retrieves based on **query similarity**: +- Query: "Quick question - explain backpropagation" +- Pref A: "When I say 'quick', be concise" (HIGH similarity to "quick question") +- Pref B: "For complex ML topics, explain in detail" (lower similarity) +- RAG retrieves Pref A → correct! +- Context methods see BOTH → confused, might give long response + +### Conflict Categories + +| Conflict Type | Example | Why RAG Wins | +|---------------|---------|--------------| +| Format (bullets vs numbered) | "Steps to deploy + list mistakes" | Retrieves format per section | +| Verbosity (concise vs detailed) | "Quick question - how does X work?" | Explicit brevity cue retrieved | +| Naming (snake vs camel) | "Write in Python and JavaScript" | Language detection per block | +| Answer position | "What is X and why?" | "What is" retrieves answer-first | +| Autonomy | "Refactor this" vs "Change line 5" | Scope detection | +| Example position | "How to use API" vs "What is recursion" | Practical vs theoretical | + +### Why Context Methods Fail + +1. **All prefs visible**: Sees both "be concise" AND "explain complex topics in detail" +2. **No ranking**: Can't determine which applies more to this query +3. **Tries to satisfy both**: Gives medium-length response satisfying neither +4. **Inconsistent**: Sometimes picks wrong one based on position in context + +--- + +## Challenging Datasets + +### Existing (keep) +- `math-500` (HuggingFaceH4/MATH-500) +- `humaneval` (code generation) +- `bigcodebench` (code completion) + +### New (add for difficulty) +| Dataset | Domain | Why Challenging | +|---------|--------|-----------------| +| GPQA | PhD-level science | Complex multi-step reasoning | +| TheoremQA | Math proofs | Requires structured proof format | +| LiveCodeBench | Competitive programming | Recent, hard problems | +| AIME | Math competition | Olympiad-level difficulty | +| SciCode | Scientific computing | Domain expertise needed | + +### Step-by-Step Encouragement +Add prompts that encourage longer, multi-turn problem solving: +``` +"Work through this carefully: +1. Identify the problem type +2. State key concepts needed +3. Solve step by step +4. Verify your answer" +``` + +This creates more opportunities for preferences to be revealed and conflicts to occur. + +--- + +## Session Structure + +### Multi-Session Accumulation +``` +Session 1: User reveals ~4 format/verbosity preferences +Session 2: User reveals ~4 code style preferences +Session 3: User reveals ~4 math preferences +... +Session 10: All 40 preferences established +``` + +### Per Session +- 3-5 problems per session +- 10-15 turns per problem (step-by-step solving) +- Total: ~50 turns per session → ~500 turns after 10 sessions + +### Context Accumulation +- After 10 sessions: ~50,000+ tokens of history +- Contextual Memory baseline MUST summarize → loses specific preferences +- All Memory Cards baseline: 40 prefs × 50 tokens = 2000 tokens (manageable but conflicting) +- RAG: only retrieves ~5 relevant prefs per query + +--- + +## Metrics + +### Primary Metrics (from their paper) + +| Metric | Description | Lower/Higher Better | +|--------|-------------|---------------------| +| Task Accuracy | % of problems solved correctly | Higher | +| User Effort | User token count | Lower | +| Efficiency | Total tokens used | Lower | + +### New Metrics (to prove our hypotheses) + +| Metric | Description | What It Shows | +|--------|-------------|---------------| +| Conflict Resolution Accuracy | % of conflict scenarios resolved correctly | RAG's advantage | +| Preference Enforcement Rate | How often user had to correct agent | Quality of personalization | +| Context Degradation | Accuracy drop as sessions increase | Scalability | +| User Vector Similarity | Cosine sim between z_long and ground truth | User modeling quality | +| Cross-User Discrimination | Different users have different vectors | Vectors capture identity | + +--- + +## Expected Results + +### Hypothesis 1: RAG beats context on scalability +``` +Session 1-3: All methods ~similar accuracy +Session 4-6: Contextual Memory starts degrading +Session 7-10: Contextual Memory significantly worse +RAG methods: Consistent accuracy throughout +``` + +### Hypothesis 2: RAG beats all-in-context on conflicts +``` +Conflict Resolution Accuracy: +- Vanilla: N/A (no personalization) +- Contextual Memory: ~40% (random when sees both prefs) +- All Memory Cards: ~50% (sees all, sometimes picks right one) +- Extractor + RAG: ~70% (retrieves relevant, but sometimes both) +- Extractor + RAG + User Vector: ~85% (vector learns which wins) +``` + +### Hypothesis 3: User vectors model users well +``` +Intra-user similarity (same user, different sessions): >0.9 +Inter-user similarity (different users): <0.5 +Cluster quality: Users with similar preferences cluster together +``` + +### Hypothesis 4: Lower user effort with better personalization +``` +User Tokens (lower = better): +- Vanilla: 100% (baseline) +- Contextual Memory: 90% +- All Memory Cards: 85% +- Extractor + RAG: 70% +- Extractor + RAG + User Vector: 60% +``` + +--- + +## Experiment Execution + +### Phase 1: Profile Generation +```bash +python scripts/generate_complex_profiles.py \ + --num_profiles 50 \ + --output_dir collaborativeagents/data/complex_profiles \ + --model gpt-4o-mini +``` + +### Phase 2: Conflict Scenario Generation +```bash +python scripts/conflict_scenario_generator.py \ + --profiles collaborativeagents/data/complex_profiles/profiles.jsonl \ + --output collaborativeagents/data/conflict_scenarios.jsonl \ + --scenarios_per_conflict 5 +``` + +### Phase 3: Baseline Comparison +```bash +python scripts/run_baseline_comparison.py \ + --baselines vanilla contextual_memory reflection_memory \ + all_memory_cards extractor_rag extractor_rag_user_vector \ + --dataset math-500 \ + --num_sessions 10 \ + --num_users 50 \ + --output_dir collaborativeagents/results +``` + +### Phase 4: Analysis +- Degradation curves (accuracy vs session number) +- Conflict resolution accuracy by type +- User vector quality analysis +- Token efficiency comparison + +--- + +## Files + +| File | Purpose | +|------|---------| +| `data/preference_schema_v2_sample.json` | Sample preference schema | +| `scripts/generate_complex_profiles.py` | LLM-based profile generation | +| `scripts/conflict_scenario_generator.py` | Conflict test generation | +| `scripts/run_baseline_comparison.py` | Experiment runner | +| `results/` | Output directory | + +--- + +## Next Steps + +1. [ ] Generate 50 complex user profiles +2. [ ] Generate conflict scenarios for each profile +3. [ ] Integrate your PersonalizedLLM as the RAG + User Vector baseline +4. [ ] Implement adapter for CollaborativeAgents conversation flow +5. [ ] Run experiments on math-500 first (validate setup) +6. [ ] Add challenging datasets (GPQA, TheoremQA, etc.) +7. [ ] Run full comparison experiments +8. [ ] Generate analysis plots and tables |
