# CLAUDE.md This file provides guidance to Claude Code when working with this repository. --- ## Project Overview **Personalization User Model** is a research project for **personalized LLM assistants** that learn and adapt to individual user preferences over multi-session interactions. ### Research Goal Build an AI assistant that: 1. **Extracts** user preferences from conversation (e.g., "I prefer bullet points", "show me step-by-step math") 2. **Stores** preferences in a retrieval-augmented memory system 3. **Retrieves** relevant preferences using dense retrieval + reranking 4. **Adapts** responses using a learned user vector (REINFORCE-based RL) 5. **Improves** over multiple sessions without explicit user configuration ### Key Innovation Unlike static preference profiles, this system uses: - **Online preference extraction** from natural conversation - **Policy-based memory retrieval** with user-specific vectors - **REINFORCE updates** from implicit user feedback (enforcement signals) --- ## Repository Structure ``` personalization-user-model/ ├── src/personalization/ # Core personalization library │ ├── serving/ # Main PersonalizedLLM class │ │ └── personalized_llm.py # Primary inference interface │ ├── models/ # LLM backends (vLLM, local) │ │ └── llm/vllm_chat.py # vLLM HTTP client │ ├── retrieval/ # Dense retrieval + reranking │ ├── feedback/ # REINFORCE reward processing │ ├── user_model/ # User vector learning │ └── evaluation/ # Metrics and analysis │ ├── collaborativeagents/ # Experiment framework (MULTISESSIONCOLLAB) │ ├── adapters/ # Method adapters for experiments │ │ ├── personalized_llm_adapter.py # RAG methods │ │ ├── contextual_adapter.py # Full-context baseline │ │ └── reflection_adapter.py # CollaborativeAgents baseline │ ├── agents/ # Batch processing clients │ │ └── batch_vllm_agent.py # Async vLLM/OpenAI batching │ ├── scripts/ # Experiment runners │ │ └── run_experiments.py # Main experiment script │ ├── slurm/ # HPC job scripts │ │ └── fullscale/ # Full-scale experiment jobs │ ├── data/ # User profiles and datasets │ │ └── complex_profiles_v2/ # 200 user profiles (43 prefs each) │ └── results/ # Experiment outputs │ ├── models/ # Downloaded HuggingFace models (not in git) │ ├── llama-3.1-8b-instruct/ # Agent LLM │ ├── qwen3-embedding-8b/ # Dense embeddings │ └── rerankers/ # Qwen3 8B reranker │ ├── data/ # Datasets and corpora (not in git) │ ├── corpora/ # Memory card storage │ └── users/ # User state persistence │ └── LLaMA-Factory/ # External fine-tuning toolkit ``` --- ## Methods (Baselines) The experiment compares **6 methods** for multi-session personalization: | Method | Description | Memory Type | |--------|-------------|-------------| | `vanilla` | No memory, pure LLM | None | | `contextual` | Full conversation history in context | In-context | | `reflection` | Session-level reflection → agent_notes | Summarized notes | | `all_memory` | All extracted preferences in context | All memories | | `rag` | Dense retrieval + reranking (no user vector) | Retrieved top-k | | `rag_vector` | RAG + learned user vector (proposed) | Retrieved + personalized | --- ## Setup ### 1. Environment ```bash cd /projects/bfqt/users/yurenh2/ml-projects/personalization-user-model source /u/yurenh2/miniforge3/etc/profile.d/conda.sh conda activate eval export PYTHONPATH="${PWD}/src:${PWD}/collaborativeagents:${PYTHONPATH}" export HF_HOME=/projects/bfqt/users/yurenh2/hf_cache/huggingface ``` ### 2. Environment Variables Create `.env` file with: ```bash OPENAI_API_KEY=sk-... # For GPT user simulator HF_TOKEN=hf_... # For gated models ``` ### 3. Models Required - **Agent LLM**: `models/llama-3.1-8b-instruct/` (local) or vLLM server - **User Simulator**: 70B model via vLLM or OpenAI API (gpt-5-mini) - **Embeddings**: `models/qwen3-embedding-8b/` (for RAG methods) - **Reranker**: `models/rerankers/` or Qwen3-8B (for RAG methods) ### 4. Storage Locations | Path | Quota | Usage | |------|-------|-------| | `/projects/bfqt` | 500GB (soft) | Code, models, results | | `/work/hdd/bfqt` | 1TB | Overflow storage, large checkpoints | | `/work/nvme/bfqt` | 500GB | Fast scratch (temporary) | --- ## Running Experiments ### Quick Test (2 profiles, 2 sessions) ```bash cd collaborativeagents/scripts python run_experiments.py \ --methods vanilla \ --datasets math-hard \ --n-profiles 2 \ --n-sessions 2 \ --max-turns 8 \ --use-vllm \ --vllm-agent-url http://localhost:8003/v1 \ --parallel-profiles 2 \ --profile-path ../data/complex_profiles_v2/profiles_200.jsonl \ --output-dir ../results/test ``` ### Full-Scale Experiment ```bash # GPU Layout (4x A100 80GB): # GPU 0-1: 70B user simulator (TP=2) # GPU 2: 8B agent # GPU 3: Embedding + Reranker cd collaborativeagents/slurm/fullscale sbatch test_local_user.sh ``` ### Key Arguments | Argument | Description | |----------|-------------| | `--methods` | Comma-separated: vanilla,contextual,reflection,all_memory,rag,rag_vector | | `--n-profiles` | Number of user profiles (max 200) | | `--n-sessions` | Sessions per profile | | `--max-turns` | Max turns per session | | `--use-vllm` | Use vLLM for agent (required for batching) | | `--use-openai-user` | Use OpenAI API for user simulator | | `--vllm-user-url` | Local vLLM user simulator URL | | `--parallel-profiles` | Batch size for turn-synchronous processing | | `--reward-mode` | `keyword` (heuristic) or `llm` (GPT-5-nano judge) | --- ## Current Results ### Completed Experiments Located in `collaborativeagents/results/`: | Experiment | Profiles | Sessions | Methods | Status | |------------|----------|----------|---------|--------| | `rag_vector_v2_*` | 10 | 10 | rag_vector | Complete | | `gpt_user_all_methods_*` | 5-10 | 2-5 | all 6 | Partial | | `test_50parallel_*` | 50 | 1 | vanilla | Test only | ### Throughput Benchmarks | Setup | Throughput | Notes | |-------|------------|-------| | OpenAI user + vLLM agent | ~60 sessions/hr | API latency bottleneck | | Local 70B user + 8B agent | ~2000+ sessions/hr | Expected (not yet tested) | --- ## Experiments To Be Done ### 1. Full-Scale Benchmark (Priority: HIGH) **Goal**: 200 profiles × 6 methods × 15 sessions = 18,000 sessions **Setup**: - User simulator: Local 70B (vLLM, TP=2) - NOT OpenAI (too slow) - Agent: 8B LLaMA (vLLM) - Reward: LLM judge (GPT-5-nano via API) **GPU Layout** (4x A100 80GB): ``` GPU 0-1: 70B user simulator (AWQ INT4, TP=2) GPU 2: 8B agent GPU 3: Embedding + Reranker (for RAG methods) ``` **Jobs**: Split by method and profile range (50 profiles each) ``` collaborativeagents/slurm/fullscale/ ├── run_vanilla_p{0,50,100,150}.sh ├── run_contextual_p{0,50,100,150}.sh ├── run_reflection_p{0,50,100,150}.sh ├── run_all_memory_p{0,50,100,150}.sh ├── run_rag_p{0,50,100,150}.sh ├── run_rag_vector_p{0,50,100,150}.sh └── submit_all.sh ``` ### 2. Session Extension (Priority: MEDIUM) If 15 sessions insufficient, continue to 30 sessions using checkpoint: ```bash python run_experiments.py \ --n-sessions 30 \ --continue-from ../results/fullscale_15sess/... ``` ### 3. Ablation Studies (Priority: LOW) - RAG with BGE reranker (278M) vs Qwen3 (8B) - Best-of-N sampling (N=3) for RAG methods - Different embedding models --- ## Key Files Reference ### Core Personalization | File | Purpose | |------|---------| | `src/personalization/serving/personalized_llm.py` | Main inference class with `chat()`, `chat_prepare()`, `chat_complete()` | | `src/personalization/models/llm/vllm_chat.py` | vLLM HTTP client with `build_messages()`, `answer()` | | `src/personalization/retrieval/policy.py` | Memory retrieval with user vector | | `src/personalization/feedback/llm_reward.py` | GPT-based reward judge | ### Experiment Framework | File | Purpose | |------|---------| | `collaborativeagents/scripts/run_experiments.py` | Main experiment runner with batch processing | | `collaborativeagents/adapters/*.py` | Method-specific adapters with `prepare_prompt()`, `process_response()` | | `collaborativeagents/agents/batch_vllm_agent.py` | `BatchVLLMClient` and `BatchOpenAIClient` for async batching | ### Data | File | Purpose | |------|---------| | `collaborativeagents/data/complex_profiles_v2/profiles_200.jsonl` | 200 user profiles with 43 preferences each | | `data/corpora/empty_store/` | Empty memory store for fresh experiments | --- ## Troubleshooting ### Quota Exceeded ```bash # Check quota quota -s # Move large files to HDD storage mv /projects/bfqt/users/yurenh2/large_dir /work/hdd/bfqt/users/yurenh2/ ``` ### vLLM Server Issues ```bash # Check if server is running curl http://localhost:8003/health # Kill existing servers pkill -f "vllm.entrypoints" ``` ### Out of GPU Memory - Reduce `--gpu-memory-utilization` (default 0.90) - Reduce `--max-model-len` (default 8192) - Use quantized models (AWQ INT4) ### Slow Throughput - Use local vLLM user simulator instead of OpenAI API - Increase `--parallel-profiles` for better batching - Check vLLM logs for "Running: N reqs" to verify batching --- ## Code Conventions 1. **Batch Processing**: All adapters must implement `prepare_prompt()` and `process_response()` for batched vLLM calls 2. **Device Assignment**: GPUs 0-1 for large models, GPU 2 for agent, GPU 3 for embedding/reranker 3. **Checkpoints**: Session-level tracking in `checkpoint.json` with `sessions_per_profile` dict 4. **Results**: JSON format in `results.json` with metrics per session --- ## Contact For questions about this codebase, refer to the experiment plan at: `/u/yurenh2/.claude/plans/effervescent-mapping-ocean.md`