# Internal Project Notes: Personalization Benchmark > **Purpose**: Internal documentation for AI assistants and developers to understand project state, design decisions, and next steps. --- ## 1. Project Overview ### What This Project Is A benchmark to evaluate **retrieval-based personalization** (user's main project) against various baselines using a **user simulator framework** (adapted from CollaborativeAgents/MULTISESSIONCOLLAB). ### Core Hypothesis **Extractor + RAG + User Vector** outperforms all baselines because: 1. **Scalability**: RAG retrieves only relevant preferences (doesn't overflow context) 2. **Conflict Resolution**: User vector learns which preferences apply in which situations 3. **Avoids Over-personalization**: Doesn't apply irrelevant preferences ### User's Main Project Location `/projects/bfqt/users/yurenh2/ml-projects/personalization-user-model/src/personalization/` Key components: - `serving/personalized_llm.py` - Main interface with `chat()`, `apply_feedback()`, `reset_session()` - `retrieval/pipeline.py` - Dense retrieval + reranking + policy-based selection - `user_model/tensor_store.py` - Dual vectors: z_long (persistent), z_short (session) - `feedback/policy/reinforce.py` - REINFORCE updates from inferred rewards --- ## 2. Experiment Design ### Baselines (7 methods) | Method | Description | Expected Weakness | |--------|-------------|-------------------| | `vanilla` | No memory | Forgets everything | | `contextual` | Full history in context, summarize at limit | Summary loses specific preferences | | `reflection` | CollaborativeAgents' agent_notes | Notes become unwieldy | | `reflection_grpo` | Reflection + GRPO training | Better but no selective retrieval | | `all_memory` | All extracted memories in context | Over-personalization, conflict confusion | | `rag` | Extractor + RAG (no user vector) | Good retrieval but random on conflicts | | `rag_vector` | Extractor + RAG + user vector | **Proposed best method** | ### Datasets **Existing (from CollaborativeAgents):** - math-500, math-hard, humaneval, bigcodebench, logiqa, mmlu, medqa **New challenging datasets (added):** - `gpqa` - PhD-level science (extremely hard) - `theoremqa` - Theorem-based math proofs - `aime` - Competition math (answers 0-999) - `livecodebench` - Recent competitive programming - `scicode` - Scientific computing ### Metrics | Metric | Description | Source | |--------|-------------|--------| | Task Success | Did agent solve the problem? | LLM judge | | User Effort | User's token count + enforcement count | Automatic | | Efficiency | Total tokens used | Automatic | | Conflict Resolution Accuracy | Picked correct preference in conflicts? | LLM judge | | Over-personalization Rate | Applied irrelevant preferences? | LLM judge | | Preference Compliance | Followed applicable preferences? | LLM judge | ### User Profiles **Key change from original:** - Original: 3 flat preferences per user (too simple) - Ours: **~40 conditional preferences** with explicit conditions Example: ```json { "condition": "debugging code", "preference": "show full stack traces and line numbers", "conflict_group": "code_detail" } ``` **Conflict groups** (15 types): - format_structure (bullets vs numbered) - verbosity_level (concise vs detailed) - code_naming (snake_case vs camelCase by language) - math_detail (derivations vs final answer) - guidance_style (incremental vs holistic) - etc. ### Conflict Testing Design **Critical for proving hypothesis.** Every conflict query: 1. Triggers 2+ conflicting preferences 2. Has ONE correct preference based on context 3. RAG naturally retrieves the correct one 4. Context methods see ALL and get confused Example: ``` Query: "Quick question - how does backpropagation work?" Triggered preferences: ✓ "When I say 'quick', be concise" (matches query signal) ✗ "For complex ML topics, explain in detail" (lower relevance) RAG retrieves: concise preference (CORRECT) Context methods see: BOTH → confused → often wrong ``` --- ## 3. Technical Architecture ### Models - **User Simulator**: Llama-3.3-70B-Instruct (powerful, simulates realistic user) - **Agent**: Llama-3.1-8B-Instruct (smaller, realistic deployment) - **LLM Judge**: Llama-3.3-70B-Instruct (evaluates quality) ### Compute Resources (from sinfo) - `gpu` partition: 4x A100 80GB - `h100` partition: 4x H100 80GB - Good for running 70B models ### Session Structure - 100 user profiles - 20 sessions per profile - 15 max turns per session - 30% of queries are conflict tests --- ## 4. Files Created ### Core Experiment Framework | File | Status | Purpose | |------|--------|---------| | `scripts/run_experiments.py` | DONE | Main orchestrator - runs all methods, evaluates, generates report | | `scripts/run_baseline_comparison.py` | DONE | Baseline comparison runner | | `scripts/conflict_scenario_generator.py` | DONE | Generates 70+ conflict templates | | `evaluation/llm_judge.py` | DONE | LLM-as-judge for all metrics | | `evaluation/__init__.py` | DONE | Module exports | ### Profile Generation | File | Status | Purpose | |------|--------|---------| | `data/preference_schema_v2_sample.json` | DONE | Sample schema with 40 preferences, 15 conflict groups | | `scripts/generate_complex_profiles.py` | DONE | LLM-based batch generation | | `scripts/generate_profiles_v2.py` | DONE | Profile generation with quality control | ### Dataset & Prompts | File | Status | Purpose | |------|--------|---------| | `datasets_extended.py` | DONE | New challenging datasets (GPQA, AIME, etc.) | | `prompts_extended.py` | DONE | Step-by-step prompts to make sessions longer | ### Integration | File | Status | Purpose | |------|--------|---------| | `adapters/personalized_llm_adapter.py` | DONE | Wraps PersonalizedLLM for benchmark | ### Documentation | File | Status | Purpose | |------|--------|---------| | `EXPERIMENT_DESIGN.md` | DONE | Human-readable design doc | | `INTERNAL_NOTES.md` | DONE | This file - AI/developer notes | --- ## 5. What's TBD (To Be Done) ### High Priority 1. **Generate 100 User Profiles** - Run `scripts/generate_profiles_v2.py` with LLM - Need SLURM script for cluster - Output: `data/complex_profiles_100.json` 2. **End-to-End Integration Test** - Verify PersonalizedLLM adapter works with conversation generator - Test with 2-3 profiles, 1-2 sessions each - Fix any import/path issues 3. **SLURM Scripts for Cluster** - Profile generation job - Main experiment job (may need multi-GPU) ### Medium Priority 4. **Baseline Adapter Implementations** - `vanilla` - simple, no memory - `contextual` - needs summarization logic - `reflection` - use existing CollaborativeAgents code - `reflection_grpo` - need trained model - `all_memory` - dump all memories in context 5. **User Simulator Integration** - Ensure user agent enforces preferences - Ensure user agent expresses disappointment - Both should count as negative signals ### Lower Priority 6. **User Vector Similarity Analysis** - Create ground truth vectors from preference categories - Compare learned z_long to ground truth - Proves user vectors capture identity 7. **Visualization & Reporting** - Token usage graphs over sessions - Conflict resolution accuracy by type - User effort reduction curves --- ## 6. Key Design Decisions (Reference) ### Why 40 Preferences? - Original 3 is too simple - any method can handle it - 40 creates realistic complexity - Forces context-based methods to overflow or summarize (losing info) - Creates many conflict opportunities ### Why Conditional Preferences? - Real users have context-dependent preferences - "Be concise WHEN I'm rushed" vs "Explain thoroughly WHEN I'm learning" - RAG naturally handles this via query similarity - Context methods struggle with conflicting conditions ### Why Step-by-Step Prompts? - Makes sessions longer (more turns) - More opportunities for preference expression/violation - Generates more context (stresses context-based methods) - More realistic for complex problems ### Why LLM-as-Judge? - Preference compliance is subjective - Can't programmatically check "was this concise enough?" - LLM judge provides nuanced evaluation - Using 70B model for quality --- ## 7. Code Locations Reference ### User's Personalization System ``` src/personalization/ ├── serving/personalized_llm.py # Main API ├── retrieval/pipeline.py # RAG pipeline ├── user_model/tensor_store.py # User vectors ├── feedback/policy/reinforce.py # Online learning ├── models/preference_extractor/ # Extraction └── configs/ # Config files ``` ### CollaborativeAgents (Original) ``` collaborativeagents/collaborativeagents/ ├── agents/user_agent.py # User simulator ├── agents/collaborator_agent.py # Agent interface ├── conversation_generator.py # Conversation loop ├── conversation_evaluator.py # Evaluation ├── prompts.py # Original prompts └── datasets/ # Original datasets ``` ### Our Extensions ``` collaborativeagents/ ├── scripts/ │ ├── run_experiments.py # Main entry point │ ├── generate_profiles_v2.py # Profile generation │ └── conflict_scenario_generator.py ├── evaluation/ │ └── llm_judge.py # LLM evaluation ├── adapters/ │ └── personalized_llm_adapter.py # Integration ├── datasets_extended.py # New datasets ├── prompts_extended.py # Step-by-step prompts └── data/ └── preference_schema_v2_sample.json ``` --- ## 8. Quick Start Commands (TBD - update after testing) ```bash # Generate profiles (need SLURM script) python scripts/generate_profiles_v2.py --n-profiles 100 --output data/profiles.json # Run experiments python scripts/run_experiments.py \ --methods vanilla,contextual,rag,rag_vector \ --datasets gpqa,aime \ --n-profiles 100 \ --n-sessions 20 \ --output-dir results/ # Quick test (2 profiles, 2 sessions) python scripts/run_experiments.py \ --methods rag_vector \ --datasets math-500 \ --n-profiles 2 \ --n-sessions 2 \ --output-dir results/test/ ``` --- ## 9. Dataset & Profile Generation Details ### Datasets **Status**: NOT pre-downloaded. Uses HuggingFace `datasets` library. - Will be downloaded on first use via `load_dataset()` - Cached locally after first download - Some datasets may require authentication (GPQA requires HuggingFace login) **To pre-download** (optional): ```python from datasets import load_dataset datasets_to_download = [ ("HuggingFaceH4/MATH-500", None), ("lighteval/MATH-Hard", None), ("openai/openai_humaneval", None), ("bigcode/bigcodebench", None), ("Idavidrein/gpqa", "gpqa_diamond"), # May need HF token ("TIGER-Lab/TheoremQA", None), # etc. ] for name, config in datasets_to_download: load_dataset(name, config) ``` ### Profile Generation **LLM Used**: `meta-llama/Llama-3.1-70B-Instruct` via `litellm` - Requires litellm installed: `pip install litellm` - Needs API endpoint or local model setup - Fallback: `generate_from_schema()` function creates profiles from predefined schema (no LLM needed) **Generation command**: ```bash # With LLM python scripts/generate_profiles_v2.py --n-profiles 100 --model meta-llama/Llama-3.1-70B-Instruct # Without LLM (from schema) python scripts/generate_profiles_v2.py --n-profiles 100 --from-schema data/preference_schema_v2_sample.json ``` --- ## 10. Notes for Future Sessions - User prefers 100 profiles but is flexible based on compute - User wants all new challenging datasets included - User wants LLM as judge (not programmatic) - User LLM should be Llama-70B, Agent should be Llama-8B - PDF in `collaborativeagents/` folder describes original experiments - User's PersonalizedLLM has easy-to-use `chat()` function - prefer using it --- *Last updated: Session where datasets_extended.py and llm_judge.py were created*