From c06ec2f3b80f8968f09eb801b69237495b055ec1 Mon Sep 17 00:00:00 2001
From: YurenHao0426 <blackhao0426@gmail.com>
Date: Tue, 27 Jan 2026 10:08:01 -0600
Subject: add CLAUDE.md

---
 CLAUDE.md | 295 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 295 insertions(+)
 create mode 100644 CLAUDE.md

diff --git a/CLAUDE.md b/CLAUDE.md
new file mode 100644
index 0000000..8819e73
--- /dev/null
+++ b/CLAUDE.md
@@ -0,0 +1,295 @@
+# CLAUDE.md
+
+This file provides guidance to Claude Code when working with this repository.
+
+---
+
+## Project Overview
+
+**Personalization User Model** is a research project for **personalized LLM assistants** that learn and adapt to individual user preferences over multi-session interactions.
+
+### Research Goal
+Build an AI assistant that:
+1. **Extracts** user preferences from conversation (e.g., "I prefer bullet points", "show me step-by-step math")
+2. **Stores** preferences in a retrieval-augmented memory system
+3. **Retrieves** relevant preferences using dense retrieval + reranking
+4. **Adapts** responses using a learned user vector (REINFORCE-based RL)
+5. **Improves** over multiple sessions without explicit user configuration
+
+### Key Innovation
+Unlike static preference profiles, this system uses:
+- **Online preference extraction** from natural conversation
+- **Policy-based memory retrieval** with user-specific vectors
+- **REINFORCE updates** from implicit user feedback (enforcement signals)
+
+---
+
+## Repository Structure
+
+```
+personalization-user-model/
+├── src/personalization/           # Core personalization library
+│   ├── serving/                   # Main PersonalizedLLM class
+│   │   └── personalized_llm.py    # Primary inference interface
+│   ├── models/                    # LLM backends (vLLM, local)
+│   │   └── llm/vllm_chat.py       # vLLM HTTP client
+│   ├── retrieval/                 # Dense retrieval + reranking
+│   ├── feedback/                  # REINFORCE reward processing
+│   ├── user_model/                # User vector learning
+│   └── evaluation/                # Metrics and analysis
+│
+├── collaborativeagents/           # Experiment framework (MULTISESSIONCOLLAB)
+│   ├── adapters/                  # Method adapters for experiments
+│   │   ├── personalized_llm_adapter.py  # RAG methods
+│   │   ├── contextual_adapter.py        # Full-context baseline
+│   │   └── reflection_adapter.py        # CollaborativeAgents baseline
+│   ├── agents/                    # Batch processing clients
+│   │   └── batch_vllm_agent.py    # Async vLLM/OpenAI batching
+│   ├── scripts/                   # Experiment runners
+│   │   └── run_experiments.py     # Main experiment script
+│   ├── slurm/                     # HPC job scripts
+│   │   └── fullscale/             # Full-scale experiment jobs
+│   ├── data/                      # User profiles and datasets
+│   │   └── complex_profiles_v2/   # 200 user profiles (43 prefs each)
+│   └── results/                   # Experiment outputs
+│
+├── models/                        # Downloaded HuggingFace models (not in git)
+│   ├── llama-3.1-8b-instruct/     # Agent LLM
+│   ├── qwen3-embedding-8b/        # Dense embeddings
+│   └── rerankers/                 # Qwen3 8B reranker
+│
+├── data/                          # Datasets and corpora (not in git)
+│   ├── corpora/                   # Memory card storage
+│   └── users/                     # User state persistence
+│
+└── LLaMA-Factory/                 # External fine-tuning toolkit
+```
+
+---
+
+## Methods (Baselines)
+
+The experiment compares **6 methods** for multi-session personalization:
+
+| Method | Description | Memory Type |
+|--------|-------------|-------------|
+| `vanilla` | No memory, pure LLM | None |
+| `contextual` | Full conversation history in context | In-context |
+| `reflection` | Session-level reflection → agent_notes | Summarized notes |
+| `all_memory` | All extracted preferences in context | All memories |
+| `rag` | Dense retrieval + reranking (no user vector) | Retrieved top-k |
+| `rag_vector` | RAG + learned user vector (proposed) | Retrieved + personalized |
+
+---
+
+## Setup
+
+### 1. Environment
+```bash
+cd /projects/bfqt/users/yurenh2/ml-projects/personalization-user-model
+source /u/yurenh2/miniforge3/etc/profile.d/conda.sh
+conda activate eval
+
+export PYTHONPATH="${PWD}/src:${PWD}/collaborativeagents:${PYTHONPATH}"
+export HF_HOME=/projects/bfqt/users/yurenh2/hf_cache/huggingface
+```
+
+### 2. Environment Variables
+Create `.env` file with:
+```bash
+OPENAI_API_KEY=sk-...          # For GPT user simulator
+HF_TOKEN=hf_...                # For gated models
+```
+
+### 3. Models Required
+- **Agent LLM**: `models/llama-3.1-8b-instruct/` (local) or vLLM server
+- **User Simulator**: 70B model via vLLM or OpenAI API (gpt-5-mini)
+- **Embeddings**: `models/qwen3-embedding-8b/` (for RAG methods)
+- **Reranker**: `models/rerankers/` or Qwen3-8B (for RAG methods)
+
+### 4. Storage Locations
+| Path | Quota | Usage |
+|------|-------|-------|
+| `/projects/bfqt` | 500GB (soft) | Code, models, results |
+| `/work/hdd/bfqt` | 1TB | Overflow storage, large checkpoints |
+| `/work/nvme/bfqt` | 500GB | Fast scratch (temporary) |
+
+---
+
+## Running Experiments
+
+### Quick Test (2 profiles, 2 sessions)
+```bash
+cd collaborativeagents/scripts
+python run_experiments.py \
+    --methods vanilla \
+    --datasets math-hard \
+    --n-profiles 2 \
+    --n-sessions 2 \
+    --max-turns 8 \
+    --use-vllm \
+    --vllm-agent-url http://localhost:8003/v1 \
+    --parallel-profiles 2 \
+    --profile-path ../data/complex_profiles_v2/profiles_200.jsonl \
+    --output-dir ../results/test
+```
+
+### Full-Scale Experiment
+```bash
+# GPU Layout (4x A100 80GB):
+#   GPU 0-1: 70B user simulator (TP=2)
+#   GPU 2:   8B agent
+#   GPU 3:   Embedding + Reranker
+
+cd collaborativeagents/slurm/fullscale
+sbatch test_local_user.sh
+```
+
+### Key Arguments
+| Argument | Description |
+|----------|-------------|
+| `--methods` | Comma-separated: vanilla,contextual,reflection,all_memory,rag,rag_vector |
+| `--n-profiles` | Number of user profiles (max 200) |
+| `--n-sessions` | Sessions per profile |
+| `--max-turns` | Max turns per session |
+| `--use-vllm` | Use vLLM for agent (required for batching) |
+| `--use-openai-user` | Use OpenAI API for user simulator |
+| `--vllm-user-url` | Local vLLM user simulator URL |
+| `--parallel-profiles` | Batch size for turn-synchronous processing |
+| `--reward-mode` | `keyword` (heuristic) or `llm` (GPT-5-nano judge) |
+
+---
+
+## Current Results
+
+### Completed Experiments
+Located in `collaborativeagents/results/`:
+
+| Experiment | Profiles | Sessions | Methods | Status |
+|------------|----------|----------|---------|--------|
+| `rag_vector_v2_*` | 10 | 10 | rag_vector | Complete |
+| `gpt_user_all_methods_*` | 5-10 | 2-5 | all 6 | Partial |
+| `test_50parallel_*` | 50 | 1 | vanilla | Test only |
+
+### Throughput Benchmarks
+| Setup | Throughput | Notes |
+|-------|------------|-------|
+| OpenAI user + vLLM agent | ~60 sessions/hr | API latency bottleneck |
+| Local 70B user + 8B agent | ~2000+ sessions/hr | Expected (not yet tested) |
+
+---
+
+## Experiments To Be Done
+
+### 1. Full-Scale Benchmark (Priority: HIGH)
+**Goal**: 200 profiles × 6 methods × 15 sessions = 18,000 sessions
+
+**Setup**:
+- User simulator: Local 70B (vLLM, TP=2) - NOT OpenAI (too slow)
+- Agent: 8B LLaMA (vLLM)
+- Reward: LLM judge (GPT-5-nano via API)
+
+**GPU Layout** (4x A100 80GB):
+```
+GPU 0-1: 70B user simulator (AWQ INT4, TP=2)
+GPU 2:   8B agent
+GPU 3:   Embedding + Reranker (for RAG methods)
+```
+
+**Jobs**: Split by method and profile range (50 profiles each)
+```
+collaborativeagents/slurm/fullscale/
+├── run_vanilla_p{0,50,100,150}.sh
+├── run_contextual_p{0,50,100,150}.sh
+├── run_reflection_p{0,50,100,150}.sh
+├── run_all_memory_p{0,50,100,150}.sh
+├── run_rag_p{0,50,100,150}.sh
+├── run_rag_vector_p{0,50,100,150}.sh
+└── submit_all.sh
+```
+
+### 2. Session Extension (Priority: MEDIUM)
+If 15 sessions insufficient, continue to 30 sessions using checkpoint:
+```bash
+python run_experiments.py \
+    --n-sessions 30 \
+    --continue-from ../results/fullscale_15sess/...
+```
+
+### 3. Ablation Studies (Priority: LOW)
+- RAG with BGE reranker (278M) vs Qwen3 (8B)
+- Best-of-N sampling (N=3) for RAG methods
+- Different embedding models
+
+---
+
+## Key Files Reference
+
+### Core Personalization
+| File | Purpose |
+|------|---------|
+| `src/personalization/serving/personalized_llm.py` | Main inference class with `chat()`, `chat_prepare()`, `chat_complete()` |
+| `src/personalization/models/llm/vllm_chat.py` | vLLM HTTP client with `build_messages()`, `answer()` |
+| `src/personalization/retrieval/policy.py` | Memory retrieval with user vector |
+| `src/personalization/feedback/llm_reward.py` | GPT-based reward judge |
+
+### Experiment Framework
+| File | Purpose |
+|------|---------|
+| `collaborativeagents/scripts/run_experiments.py` | Main experiment runner with batch processing |
+| `collaborativeagents/adapters/*.py` | Method-specific adapters with `prepare_prompt()`, `process_response()` |
+| `collaborativeagents/agents/batch_vllm_agent.py` | `BatchVLLMClient` and `BatchOpenAIClient` for async batching |
+
+### Data
+| File | Purpose |
+|------|---------|
+| `collaborativeagents/data/complex_profiles_v2/profiles_200.jsonl` | 200 user profiles with 43 preferences each |
+| `data/corpora/empty_store/` | Empty memory store for fresh experiments |
+
+---
+
+## Troubleshooting
+
+### Quota Exceeded
+```bash
+# Check quota
+quota -s
+
+# Move large files to HDD storage
+mv /projects/bfqt/users/yurenh2/large_dir /work/hdd/bfqt/users/yurenh2/
+```
+
+### vLLM Server Issues
+```bash
+# Check if server is running
+curl http://localhost:8003/health
+
+# Kill existing servers
+pkill -f "vllm.entrypoints"
+```
+
+### Out of GPU Memory
+- Reduce `--gpu-memory-utilization` (default 0.90)
+- Reduce `--max-model-len` (default 8192)
+- Use quantized models (AWQ INT4)
+
+### Slow Throughput
+- Use local vLLM user simulator instead of OpenAI API
+- Increase `--parallel-profiles` for better batching
+- Check vLLM logs for "Running: N reqs" to verify batching
+
+---
+
+## Code Conventions
+
+1. **Batch Processing**: All adapters must implement `prepare_prompt()` and `process_response()` for batched vLLM calls
+2. **Device Assignment**: GPUs 0-1 for large models, GPU 2 for agent, GPU 3 for embedding/reranker
+3. **Checkpoints**: Session-level tracking in `checkpoint.json` with `sessions_per_profile` dict
+4. **Results**: JSON format in `results.json` with metrics per session
+
+---
+
+## Contact
+
+For questions about this codebase, refer to the experiment plan at:
+`/u/yurenh2/.claude/plans/effervescent-mapping-ocean.md`
-- 
cgit v1.2.3