# CLAUDE.md

This file provides guidance to Claude Code when working with this repository.

---

## Project Overview

**Personalization User Model** is a research project for **personalized LLM assistants** that learn and adapt to individual user preferences over multi-session interactions.

### Research Goal
Build an AI assistant that:
1. **Extracts** user preferences from conversation (e.g., "I prefer bullet points", "show me step-by-step math")
2. **Stores** preferences in a retrieval-augmented memory system
3. **Retrieves** relevant preferences using dense retrieval + reranking
4. **Adapts** responses using a learned user vector (REINFORCE-based RL)
5. **Improves** over multiple sessions without explicit user configuration

### Key Innovation
Unlike static preference profiles, this system uses:
- **Online preference extraction** from natural conversation
- **Policy-based memory retrieval** with user-specific vectors
- **REINFORCE updates** from implicit user feedback (enforcement signals)

---

## Repository Structure

```
personalization-user-model/
├── src/personalization/           # Core personalization library
│   ├── serving/                   # Main PersonalizedLLM class
│   │   └── personalized_llm.py    # Primary inference interface
│   ├── models/                    # LLM backends (vLLM, local)
│   │   └── llm/vllm_chat.py       # vLLM HTTP client
│   ├── retrieval/                 # Dense retrieval + reranking
│   ├── feedback/                  # REINFORCE reward processing
│   ├── user_model/                # User vector learning
│   └── evaluation/                # Metrics and analysis
│
├── collaborativeagents/           # Experiment framework (MULTISESSIONCOLLAB)
│   ├── adapters/                  # Method adapters for experiments
│   │   ├── personalized_llm_adapter.py  # RAG methods
│   │   ├── contextual_adapter.py        # Full-context baseline
│   │   └── reflection_adapter.py        # CollaborativeAgents baseline
│   ├── agents/                    # Batch processing clients
│   │   └── batch_vllm_agent.py    # Async vLLM/OpenAI batching
│   ├── scripts/                   # Experiment runners
│   │   └── run_experiments.py     # Main experiment script
│   ├── slurm/                     # HPC job scripts
│   │   └── fullscale/             # Full-scale experiment jobs
│   ├── data/                      # User profiles and datasets
│   │   └── complex_profiles_v2/   # 200 user profiles (43 prefs each)
│   └── results/                   # Experiment outputs
│
├── models/                        # Downloaded HuggingFace models (not in git)
│   ├── llama-3.1-8b-instruct/     # Agent LLM
│   ├── qwen3-embedding-8b/        # Dense embeddings
│   └── rerankers/                 # Qwen3 8B reranker
│
├── data/                          # Datasets and corpora (not in git)
│   ├── corpora/                   # Memory card storage
│   └── users/                     # User state persistence
│
└── LLaMA-Factory/                 # External fine-tuning toolkit
```

---

## Methods (Baselines)

The experiment compares **6 methods** for multi-session personalization:

| Method | Description | Memory Type |
|--------|-------------|-------------|
| `vanilla` | No memory, pure LLM | None |
| `contextual` | Full conversation history in context | In-context |
| `reflection` | Session-level reflection → agent_notes | Summarized notes |
| `all_memory` | All extracted preferences in context | All memories |
| `rag` | Dense retrieval + reranking (no user vector) | Retrieved top-k |
| `rag_vector` | RAG + learned user vector (proposed) | Retrieved + personalized |

---

## Setup

### 1. Environment
```bash
cd /projects/bfqt/users/yurenh2/ml-projects/personalization-user-model
source /u/yurenh2/miniforge3/etc/profile.d/conda.sh
conda activate eval

export PYTHONPATH="${PWD}/src:${PWD}/collaborativeagents:${PYTHONPATH}"
export HF_HOME=/projects/bfqt/users/yurenh2/hf_cache/huggingface
```

### 2. Environment Variables
Create `.env` file with:
```bash
OPENAI_API_KEY=sk-...          # For GPT user simulator
HF_TOKEN=hf_...                # For gated models
```

### 3. Models Required
- **Agent LLM**: `models/llama-3.1-8b-instruct/` (local) or vLLM server
- **User Simulator**: 70B model via vLLM or OpenAI API (gpt-5-mini)
- **Embeddings**: `models/qwen3-embedding-8b/` (for RAG methods)
- **Reranker**: `models/rerankers/` or Qwen3-8B (for RAG methods)

### 4. Storage Locations
| Path | Quota | Usage |
|------|-------|-------|
| `/projects/bfqt` | 500GB (soft) | Code, models, results |
| `/work/hdd/bfqt` | 1TB | Overflow storage, large checkpoints |
| `/work/nvme/bfqt` | 500GB | Fast scratch (temporary) |

---

## Running Experiments

### Quick Test (2 profiles, 2 sessions)
```bash
cd collaborativeagents/scripts
python run_experiments.py \
    --methods vanilla \
    --datasets math-hard \
    --n-profiles 2 \
    --n-sessions 2 \
    --max-turns 8 \
    --use-vllm \
    --vllm-agent-url http://localhost:8003/v1 \
    --parallel-profiles 2 \
    --profile-path ../data/complex_profiles_v2/profiles_200.jsonl \
    --output-dir ../results/test
```

### Full-Scale Experiment
```bash
# GPU Layout (4x A100 80GB):
#   GPU 0-1: 70B user simulator (TP=2)
#   GPU 2:   8B agent
#   GPU 3:   Embedding + Reranker

cd collaborativeagents/slurm/fullscale
sbatch test_local_user.sh
```

### Key Arguments
| Argument | Description |
|----------|-------------|
| `--methods` | Comma-separated: vanilla,contextual,reflection,all_memory,rag,rag_vector |
| `--n-profiles` | Number of user profiles (max 200) |
| `--n-sessions` | Sessions per profile |
| `--max-turns` | Max turns per session |
| `--use-vllm` | Use vLLM for agent (required for batching) |
| `--use-openai-user` | Use OpenAI API for user simulator |
| `--vllm-user-url` | Local vLLM user simulator URL |
| `--parallel-profiles` | Batch size for turn-synchronous processing |
| `--reward-mode` | `keyword` (heuristic) or `llm` (GPT-5-nano judge) |

---

## Current Results

### Completed Experiments
Located in `collaborativeagents/results/`:

| Experiment | Profiles | Sessions | Methods | Status |
|------------|----------|----------|---------|--------|
| `rag_vector_v2_*` | 10 | 10 | rag_vector | Complete |
| `gpt_user_all_methods_*` | 5-10 | 2-5 | all 6 | Partial |
| `test_50parallel_*` | 50 | 1 | vanilla | Test only |

### Throughput Benchmarks
| Setup | Throughput | Notes |
|-------|------------|-------|
| OpenAI user + vLLM agent | ~60 sessions/hr | API latency bottleneck |
| Local 70B user + 8B agent | ~2000+ sessions/hr | Expected (not yet tested) |

---

## Experiments To Be Done

### 1. Full-Scale Benchmark (Priority: HIGH)
**Goal**: 200 profiles × 6 methods × 15 sessions = 18,000 sessions

**Setup**:
- User simulator: Local 70B (vLLM, TP=2) - NOT OpenAI (too slow)
- Agent: 8B LLaMA (vLLM)
- Reward: LLM judge (GPT-5-nano via API)

**GPU Layout** (4x A100 80GB):
```
GPU 0-1: 70B user simulator (AWQ INT4, TP=2)
GPU 2:   8B agent
GPU 3:   Embedding + Reranker (for RAG methods)
```

**Jobs**: Split by method and profile range (50 profiles each)
```
collaborativeagents/slurm/fullscale/
├── run_vanilla_p{0,50,100,150}.sh
├── run_contextual_p{0,50,100,150}.sh
├── run_reflection_p{0,50,100,150}.sh
├── run_all_memory_p{0,50,100,150}.sh
├── run_rag_p{0,50,100,150}.sh
├── run_rag_vector_p{0,50,100,150}.sh
└── submit_all.sh
```

### 2. Session Extension (Priority: MEDIUM)
If 15 sessions insufficient, continue to 30 sessions using checkpoint:
```bash
python run_experiments.py \
    --n-sessions 30 \
    --continue-from ../results/fullscale_15sess/...
```

### 3. Ablation Studies (Priority: LOW)
- RAG with BGE reranker (278M) vs Qwen3 (8B)
- Best-of-N sampling (N=3) for RAG methods
- Different embedding models

---

## Key Files Reference

### Core Personalization
| File | Purpose |
|------|---------|
| `src/personalization/serving/personalized_llm.py` | Main inference class with `chat()`, `chat_prepare()`, `chat_complete()` |
| `src/personalization/models/llm/vllm_chat.py` | vLLM HTTP client with `build_messages()`, `answer()` |
| `src/personalization/retrieval/policy.py` | Memory retrieval with user vector |
| `src/personalization/feedback/llm_reward.py` | GPT-based reward judge |

### Experiment Framework
| File | Purpose |
|------|---------|
| `collaborativeagents/scripts/run_experiments.py` | Main experiment runner with batch processing |
| `collaborativeagents/adapters/*.py` | Method-specific adapters with `prepare_prompt()`, `process_response()` |
| `collaborativeagents/agents/batch_vllm_agent.py` | `BatchVLLMClient` and `BatchOpenAIClient` for async batching |

### Data
| File | Purpose |
|------|---------|
| `collaborativeagents/data/complex_profiles_v2/profiles_200.jsonl` | 200 user profiles with 43 preferences each |
| `data/corpora/empty_store/` | Empty memory store for fresh experiments |

---

## Troubleshooting

### Quota Exceeded
```bash
# Check quota
quota -s

# Move large files to HDD storage
mv /projects/bfqt/users/yurenh2/large_dir /work/hdd/bfqt/users/yurenh2/
```

### vLLM Server Issues
```bash
# Check if server is running
curl http://localhost:8003/health

# Kill existing servers
pkill -f "vllm.entrypoints"
```

### Out of GPU Memory
- Reduce `--gpu-memory-utilization` (default 0.90)
- Reduce `--max-model-len` (default 8192)
- Use quantized models (AWQ INT4)

### Slow Throughput
- Use local vLLM user simulator instead of OpenAI API
- Increase `--parallel-profiles` for better batching
- Check vLLM logs for "Running: N reqs" to verify batching

---

## Code Conventions

1. **Batch Processing**: All adapters must implement `prepare_prompt()` and `process_response()` for batched vLLM calls
2. **Device Assignment**: GPUs 0-1 for large models, GPU 2 for agent, GPU 3 for embedding/reranker
3. **Checkpoints**: Session-level tracking in `checkpoint.json` with `sessions_per_profile` dict
4. **Results**: JSON format in `results.json` with metrics per session

---

## Contact

For questions about this codebase, refer to the experiment plan at:
`/u/yurenh2/.claude/plans/effervescent-mapping-ocean.md`