diff options
Diffstat (limited to 'collaborativeagents/INTERNAL_NOTES.md')
| -rw-r--r-- | collaborativeagents/INTERNAL_NOTES.md | 367 |
1 files changed, 367 insertions, 0 deletions
diff --git a/collaborativeagents/INTERNAL_NOTES.md b/collaborativeagents/INTERNAL_NOTES.md new file mode 100644 index 0000000..b24e655 --- /dev/null +++ b/collaborativeagents/INTERNAL_NOTES.md @@ -0,0 +1,367 @@ +# Internal Project Notes: Personalization Benchmark + +> **Purpose**: Internal documentation for AI assistants and developers to understand project state, design decisions, and next steps. + +--- + +## 1. Project Overview + +### What This Project Is +A benchmark to evaluate **retrieval-based personalization** (user's main project) against various baselines using a **user simulator framework** (adapted from CollaborativeAgents/MULTISESSIONCOLLAB). + +### Core Hypothesis +**Extractor + RAG + User Vector** outperforms all baselines because: +1. **Scalability**: RAG retrieves only relevant preferences (doesn't overflow context) +2. **Conflict Resolution**: User vector learns which preferences apply in which situations +3. **Avoids Over-personalization**: Doesn't apply irrelevant preferences + +### User's Main Project Location +`/projects/bfqt/users/yurenh2/ml-projects/personalization-user-model/src/personalization/` + +Key components: +- `serving/personalized_llm.py` - Main interface with `chat()`, `apply_feedback()`, `reset_session()` +- `retrieval/pipeline.py` - Dense retrieval + reranking + policy-based selection +- `user_model/tensor_store.py` - Dual vectors: z_long (persistent), z_short (session) +- `feedback/policy/reinforce.py` - REINFORCE updates from inferred rewards + +--- + +## 2. Experiment Design + +### Baselines (7 methods) + +| Method | Description | Expected Weakness | +|--------|-------------|-------------------| +| `vanilla` | No memory | Forgets everything | +| `contextual` | Full history in context, summarize at limit | Summary loses specific preferences | +| `reflection` | CollaborativeAgents' agent_notes | Notes become unwieldy | +| `reflection_grpo` | Reflection + GRPO training | Better but no selective retrieval | +| `all_memory` | All extracted memories in context | Over-personalization, conflict confusion | +| `rag` | Extractor + RAG (no user vector) | Good retrieval but random on conflicts | +| `rag_vector` | Extractor + RAG + user vector | **Proposed best method** | + +### Datasets + +**Existing (from CollaborativeAgents):** +- math-500, math-hard, humaneval, bigcodebench, logiqa, mmlu, medqa + +**New challenging datasets (added):** +- `gpqa` - PhD-level science (extremely hard) +- `theoremqa` - Theorem-based math proofs +- `aime` - Competition math (answers 0-999) +- `livecodebench` - Recent competitive programming +- `scicode` - Scientific computing + +### Metrics + +| Metric | Description | Source | +|--------|-------------|--------| +| Task Success | Did agent solve the problem? | LLM judge | +| User Effort | User's token count + enforcement count | Automatic | +| Efficiency | Total tokens used | Automatic | +| Conflict Resolution Accuracy | Picked correct preference in conflicts? | LLM judge | +| Over-personalization Rate | Applied irrelevant preferences? | LLM judge | +| Preference Compliance | Followed applicable preferences? | LLM judge | + +### User Profiles + +**Key change from original:** +- Original: 3 flat preferences per user (too simple) +- Ours: **~40 conditional preferences** with explicit conditions + +Example: +```json +{ + "condition": "debugging code", + "preference": "show full stack traces and line numbers", + "conflict_group": "code_detail" +} +``` + +**Conflict groups** (15 types): +- format_structure (bullets vs numbered) +- verbosity_level (concise vs detailed) +- code_naming (snake_case vs camelCase by language) +- math_detail (derivations vs final answer) +- guidance_style (incremental vs holistic) +- etc. + +### Conflict Testing Design + +**Critical for proving hypothesis.** Every conflict query: +1. Triggers 2+ conflicting preferences +2. Has ONE correct preference based on context +3. RAG naturally retrieves the correct one +4. Context methods see ALL and get confused + +Example: +``` +Query: "Quick question - how does backpropagation work?" + +Triggered preferences: + ✓ "When I say 'quick', be concise" (matches query signal) + ✗ "For complex ML topics, explain in detail" (lower relevance) + +RAG retrieves: concise preference (CORRECT) +Context methods see: BOTH → confused → often wrong +``` + +--- + +## 3. Technical Architecture + +### Models +- **User Simulator**: Llama-3.3-70B-Instruct (powerful, simulates realistic user) +- **Agent**: Llama-3.1-8B-Instruct (smaller, realistic deployment) +- **LLM Judge**: Llama-3.3-70B-Instruct (evaluates quality) + +### Compute Resources (from sinfo) +- `gpu` partition: 4x A100 80GB +- `h100` partition: 4x H100 80GB +- Good for running 70B models + +### Session Structure +- 100 user profiles +- 20 sessions per profile +- 15 max turns per session +- 30% of queries are conflict tests + +--- + +## 4. Files Created + +### Core Experiment Framework + +| File | Status | Purpose | +|------|--------|---------| +| `scripts/run_experiments.py` | DONE | Main orchestrator - runs all methods, evaluates, generates report | +| `scripts/run_baseline_comparison.py` | DONE | Baseline comparison runner | +| `scripts/conflict_scenario_generator.py` | DONE | Generates 70+ conflict templates | +| `evaluation/llm_judge.py` | DONE | LLM-as-judge for all metrics | +| `evaluation/__init__.py` | DONE | Module exports | + +### Profile Generation + +| File | Status | Purpose | +|------|--------|---------| +| `data/preference_schema_v2_sample.json` | DONE | Sample schema with 40 preferences, 15 conflict groups | +| `scripts/generate_complex_profiles.py` | DONE | LLM-based batch generation | +| `scripts/generate_profiles_v2.py` | DONE | Profile generation with quality control | + +### Dataset & Prompts + +| File | Status | Purpose | +|------|--------|---------| +| `datasets_extended.py` | DONE | New challenging datasets (GPQA, AIME, etc.) | +| `prompts_extended.py` | DONE | Step-by-step prompts to make sessions longer | + +### Integration + +| File | Status | Purpose | +|------|--------|---------| +| `adapters/personalized_llm_adapter.py` | DONE | Wraps PersonalizedLLM for benchmark | + +### Documentation + +| File | Status | Purpose | +|------|--------|---------| +| `EXPERIMENT_DESIGN.md` | DONE | Human-readable design doc | +| `INTERNAL_NOTES.md` | DONE | This file - AI/developer notes | + +--- + +## 5. What's TBD (To Be Done) + +### High Priority + +1. **Generate 100 User Profiles** + - Run `scripts/generate_profiles_v2.py` with LLM + - Need SLURM script for cluster + - Output: `data/complex_profiles_100.json` + +2. **End-to-End Integration Test** + - Verify PersonalizedLLM adapter works with conversation generator + - Test with 2-3 profiles, 1-2 sessions each + - Fix any import/path issues + +3. **SLURM Scripts for Cluster** + - Profile generation job + - Main experiment job (may need multi-GPU) + +### Medium Priority + +4. **Baseline Adapter Implementations** + - `vanilla` - simple, no memory + - `contextual` - needs summarization logic + - `reflection` - use existing CollaborativeAgents code + - `reflection_grpo` - need trained model + - `all_memory` - dump all memories in context + +5. **User Simulator Integration** + - Ensure user agent enforces preferences + - Ensure user agent expresses disappointment + - Both should count as negative signals + +### Lower Priority + +6. **User Vector Similarity Analysis** + - Create ground truth vectors from preference categories + - Compare learned z_long to ground truth + - Proves user vectors capture identity + +7. **Visualization & Reporting** + - Token usage graphs over sessions + - Conflict resolution accuracy by type + - User effort reduction curves + +--- + +## 6. Key Design Decisions (Reference) + +### Why 40 Preferences? +- Original 3 is too simple - any method can handle it +- 40 creates realistic complexity +- Forces context-based methods to overflow or summarize (losing info) +- Creates many conflict opportunities + +### Why Conditional Preferences? +- Real users have context-dependent preferences +- "Be concise WHEN I'm rushed" vs "Explain thoroughly WHEN I'm learning" +- RAG naturally handles this via query similarity +- Context methods struggle with conflicting conditions + +### Why Step-by-Step Prompts? +- Makes sessions longer (more turns) +- More opportunities for preference expression/violation +- Generates more context (stresses context-based methods) +- More realistic for complex problems + +### Why LLM-as-Judge? +- Preference compliance is subjective +- Can't programmatically check "was this concise enough?" +- LLM judge provides nuanced evaluation +- Using 70B model for quality + +--- + +## 7. Code Locations Reference + +### User's Personalization System +``` +src/personalization/ +├── serving/personalized_llm.py # Main API +├── retrieval/pipeline.py # RAG pipeline +├── user_model/tensor_store.py # User vectors +├── feedback/policy/reinforce.py # Online learning +├── models/preference_extractor/ # Extraction +└── configs/ # Config files +``` + +### CollaborativeAgents (Original) +``` +collaborativeagents/collaborativeagents/ +├── agents/user_agent.py # User simulator +├── agents/collaborator_agent.py # Agent interface +├── conversation_generator.py # Conversation loop +├── conversation_evaluator.py # Evaluation +├── prompts.py # Original prompts +└── datasets/ # Original datasets +``` + +### Our Extensions +``` +collaborativeagents/ +├── scripts/ +│ ├── run_experiments.py # Main entry point +│ ├── generate_profiles_v2.py # Profile generation +│ └── conflict_scenario_generator.py +├── evaluation/ +│ └── llm_judge.py # LLM evaluation +├── adapters/ +│ └── personalized_llm_adapter.py # Integration +├── datasets_extended.py # New datasets +├── prompts_extended.py # Step-by-step prompts +└── data/ + └── preference_schema_v2_sample.json +``` + +--- + +## 8. Quick Start Commands (TBD - update after testing) + +```bash +# Generate profiles (need SLURM script) +python scripts/generate_profiles_v2.py --n-profiles 100 --output data/profiles.json + +# Run experiments +python scripts/run_experiments.py \ + --methods vanilla,contextual,rag,rag_vector \ + --datasets gpqa,aime \ + --n-profiles 100 \ + --n-sessions 20 \ + --output-dir results/ + +# Quick test (2 profiles, 2 sessions) +python scripts/run_experiments.py \ + --methods rag_vector \ + --datasets math-500 \ + --n-profiles 2 \ + --n-sessions 2 \ + --output-dir results/test/ +``` + +--- + +## 9. Dataset & Profile Generation Details + +### Datasets +**Status**: NOT pre-downloaded. Uses HuggingFace `datasets` library. +- Will be downloaded on first use via `load_dataset()` +- Cached locally after first download +- Some datasets may require authentication (GPQA requires HuggingFace login) + +**To pre-download** (optional): +```python +from datasets import load_dataset +datasets_to_download = [ + ("HuggingFaceH4/MATH-500", None), + ("lighteval/MATH-Hard", None), + ("openai/openai_humaneval", None), + ("bigcode/bigcodebench", None), + ("Idavidrein/gpqa", "gpqa_diamond"), # May need HF token + ("TIGER-Lab/TheoremQA", None), + # etc. +] +for name, config in datasets_to_download: + load_dataset(name, config) +``` + +### Profile Generation +**LLM Used**: `meta-llama/Llama-3.1-70B-Instruct` via `litellm` +- Requires litellm installed: `pip install litellm` +- Needs API endpoint or local model setup +- Fallback: `generate_from_schema()` function creates profiles from predefined schema (no LLM needed) + +**Generation command**: +```bash +# With LLM +python scripts/generate_profiles_v2.py --n-profiles 100 --model meta-llama/Llama-3.1-70B-Instruct + +# Without LLM (from schema) +python scripts/generate_profiles_v2.py --n-profiles 100 --from-schema data/preference_schema_v2_sample.json +``` + +--- + +## 10. Notes for Future Sessions + +- User prefers 100 profiles but is flexible based on compute +- User wants all new challenging datasets included +- User wants LLM as judge (not programmatic) +- User LLM should be Llama-70B, Agent should be Llama-8B +- PDF in `collaborativeagents/` folder describes original experiments +- User's PersonalizedLLM has easy-to-use `chat()` function - prefer using it + +--- + +*Last updated: Session where datasets_extended.py and llm_judge.py were created* |
