# Internal Project Notes: Personalization Benchmark

> **Purpose**: Internal documentation for AI assistants and developers to understand project state, design decisions, and next steps.

---

## 1. Project Overview

### What This Project Is
A benchmark to evaluate **retrieval-based personalization** (user's main project) against various baselines using a **user simulator framework** (adapted from CollaborativeAgents/MULTISESSIONCOLLAB).

### Core Hypothesis
**Extractor + RAG + User Vector** outperforms all baselines because:
1. **Scalability**: RAG retrieves only relevant preferences (doesn't overflow context)
2. **Conflict Resolution**: User vector learns which preferences apply in which situations
3. **Avoids Over-personalization**: Doesn't apply irrelevant preferences

### User's Main Project Location
`/projects/bfqt/users/yurenh2/ml-projects/personalization-user-model/src/personalization/`

Key components:
- `serving/personalized_llm.py` - Main interface with `chat()`, `apply_feedback()`, `reset_session()`
- `retrieval/pipeline.py` - Dense retrieval + reranking + policy-based selection
- `user_model/tensor_store.py` - Dual vectors: z_long (persistent), z_short (session)
- `feedback/policy/reinforce.py` - REINFORCE updates from inferred rewards

---

## 2. Experiment Design

### Baselines (7 methods)

| Method | Description | Expected Weakness |
|--------|-------------|-------------------|
| `vanilla` | No memory | Forgets everything |
| `contextual` | Full history in context, summarize at limit | Summary loses specific preferences |
| `reflection` | CollaborativeAgents' agent_notes | Notes become unwieldy |
| `reflection_grpo` | Reflection + GRPO training | Better but no selective retrieval |
| `all_memory` | All extracted memories in context | Over-personalization, conflict confusion |
| `rag` | Extractor + RAG (no user vector) | Good retrieval but random on conflicts |
| `rag_vector` | Extractor + RAG + user vector | **Proposed best method** |

### Datasets

**Existing (from CollaborativeAgents):**
- math-500, math-hard, humaneval, bigcodebench, logiqa, mmlu, medqa

**New challenging datasets (added):**
- `gpqa` - PhD-level science (extremely hard)
- `theoremqa` - Theorem-based math proofs
- `aime` - Competition math (answers 0-999)
- `livecodebench` - Recent competitive programming
- `scicode` - Scientific computing

### Metrics

| Metric | Description | Source |
|--------|-------------|--------|
| Task Success | Did agent solve the problem? | LLM judge |
| User Effort | User's token count + enforcement count | Automatic |
| Efficiency | Total tokens used | Automatic |
| Conflict Resolution Accuracy | Picked correct preference in conflicts? | LLM judge |
| Over-personalization Rate | Applied irrelevant preferences? | LLM judge |
| Preference Compliance | Followed applicable preferences? | LLM judge |

### User Profiles

**Key change from original:**
- Original: 3 flat preferences per user (too simple)
- Ours: **~40 conditional preferences** with explicit conditions

Example:
```json
{
  "condition": "debugging code",
  "preference": "show full stack traces and line numbers",
  "conflict_group": "code_detail"
}
```

**Conflict groups** (15 types):
- format_structure (bullets vs numbered)
- verbosity_level (concise vs detailed)
- code_naming (snake_case vs camelCase by language)
- math_detail (derivations vs final answer)
- guidance_style (incremental vs holistic)
- etc.

### Conflict Testing Design

**Critical for proving hypothesis.** Every conflict query:
1. Triggers 2+ conflicting preferences
2. Has ONE correct preference based on context
3. RAG naturally retrieves the correct one
4. Context methods see ALL and get confused

Example:
```
Query: "Quick question - how does backpropagation work?"

Triggered preferences:
  ✓ "When I say 'quick', be concise" (matches query signal)
  ✗ "For complex ML topics, explain in detail" (lower relevance)

RAG retrieves: concise preference (CORRECT)
Context methods see: BOTH → confused → often wrong
```

---

## 3. Technical Architecture

### Models
- **User Simulator**: Llama-3.3-70B-Instruct (powerful, simulates realistic user)
- **Agent**: Llama-3.1-8B-Instruct (smaller, realistic deployment)
- **LLM Judge**: Llama-3.3-70B-Instruct (evaluates quality)

### Compute Resources (from sinfo)
- `gpu` partition: 4x A100 80GB
- `h100` partition: 4x H100 80GB
- Good for running 70B models

### Session Structure
- 100 user profiles
- 20 sessions per profile
- 15 max turns per session
- 30% of queries are conflict tests

---

## 4. Files Created

### Core Experiment Framework

| File | Status | Purpose |
|------|--------|---------|
| `scripts/run_experiments.py` | DONE | Main orchestrator - runs all methods, evaluates, generates report |
| `scripts/run_baseline_comparison.py` | DONE | Baseline comparison runner |
| `scripts/conflict_scenario_generator.py` | DONE | Generates 70+ conflict templates |
| `evaluation/llm_judge.py` | DONE | LLM-as-judge for all metrics |
| `evaluation/__init__.py` | DONE | Module exports |

### Profile Generation

| File | Status | Purpose |
|------|--------|---------|
| `data/preference_schema_v2_sample.json` | DONE | Sample schema with 40 preferences, 15 conflict groups |
| `scripts/generate_complex_profiles.py` | DONE | LLM-based batch generation |
| `scripts/generate_profiles_v2.py` | DONE | Profile generation with quality control |

### Dataset & Prompts

| File | Status | Purpose |
|------|--------|---------|
| `datasets_extended.py` | DONE | New challenging datasets (GPQA, AIME, etc.) |
| `prompts_extended.py` | DONE | Step-by-step prompts to make sessions longer |

### Integration

| File | Status | Purpose |
|------|--------|---------|
| `adapters/personalized_llm_adapter.py` | DONE | Wraps PersonalizedLLM for benchmark |

### Documentation

| File | Status | Purpose |
|------|--------|---------|
| `EXPERIMENT_DESIGN.md` | DONE | Human-readable design doc |
| `INTERNAL_NOTES.md` | DONE | This file - AI/developer notes |

---

## 5. What's TBD (To Be Done)

### High Priority

1. **Generate 100 User Profiles**
   - Run `scripts/generate_profiles_v2.py` with LLM
   - Need SLURM script for cluster
   - Output: `data/complex_profiles_100.json`

2. **End-to-End Integration Test**
   - Verify PersonalizedLLM adapter works with conversation generator
   - Test with 2-3 profiles, 1-2 sessions each
   - Fix any import/path issues

3. **SLURM Scripts for Cluster**
   - Profile generation job
   - Main experiment job (may need multi-GPU)

### Medium Priority

4. **Baseline Adapter Implementations**
   - `vanilla` - simple, no memory
   - `contextual` - needs summarization logic
   - `reflection` - use existing CollaborativeAgents code
   - `reflection_grpo` - need trained model
   - `all_memory` - dump all memories in context

5. **User Simulator Integration**
   - Ensure user agent enforces preferences
   - Ensure user agent expresses disappointment
   - Both should count as negative signals

### Lower Priority

6. **User Vector Similarity Analysis**
   - Create ground truth vectors from preference categories
   - Compare learned z_long to ground truth
   - Proves user vectors capture identity

7. **Visualization & Reporting**
   - Token usage graphs over sessions
   - Conflict resolution accuracy by type
   - User effort reduction curves

---

## 6. Key Design Decisions (Reference)

### Why 40 Preferences?
- Original 3 is too simple - any method can handle it
- 40 creates realistic complexity
- Forces context-based methods to overflow or summarize (losing info)
- Creates many conflict opportunities

### Why Conditional Preferences?
- Real users have context-dependent preferences
- "Be concise WHEN I'm rushed" vs "Explain thoroughly WHEN I'm learning"
- RAG naturally handles this via query similarity
- Context methods struggle with conflicting conditions

### Why Step-by-Step Prompts?
- Makes sessions longer (more turns)
- More opportunities for preference expression/violation
- Generates more context (stresses context-based methods)
- More realistic for complex problems

### Why LLM-as-Judge?
- Preference compliance is subjective
- Can't programmatically check "was this concise enough?"
- LLM judge provides nuanced evaluation
- Using 70B model for quality

---

## 7. Code Locations Reference

### User's Personalization System
```
src/personalization/
├── serving/personalized_llm.py      # Main API
├── retrieval/pipeline.py            # RAG pipeline
├── user_model/tensor_store.py       # User vectors
├── feedback/policy/reinforce.py     # Online learning
├── models/preference_extractor/     # Extraction
└── configs/                         # Config files
```

### CollaborativeAgents (Original)
```
collaborativeagents/collaborativeagents/
├── agents/user_agent.py             # User simulator
├── agents/collaborator_agent.py     # Agent interface
├── conversation_generator.py        # Conversation loop
├── conversation_evaluator.py        # Evaluation
├── prompts.py                       # Original prompts
└── datasets/                        # Original datasets
```

### Our Extensions
```
collaborativeagents/
├── scripts/
│   ├── run_experiments.py           # Main entry point
│   ├── generate_profiles_v2.py      # Profile generation
│   └── conflict_scenario_generator.py
├── evaluation/
│   └── llm_judge.py                 # LLM evaluation
├── adapters/
│   └── personalized_llm_adapter.py  # Integration
├── datasets_extended.py             # New datasets
├── prompts_extended.py              # Step-by-step prompts
└── data/
    └── preference_schema_v2_sample.json
```

---

## 8. Quick Start Commands (TBD - update after testing)

```bash
# Generate profiles (need SLURM script)
python scripts/generate_profiles_v2.py --n-profiles 100 --output data/profiles.json

# Run experiments
python scripts/run_experiments.py \
  --methods vanilla,contextual,rag,rag_vector \
  --datasets gpqa,aime \
  --n-profiles 100 \
  --n-sessions 20 \
  --output-dir results/

# Quick test (2 profiles, 2 sessions)
python scripts/run_experiments.py \
  --methods rag_vector \
  --datasets math-500 \
  --n-profiles 2 \
  --n-sessions 2 \
  --output-dir results/test/
```

---

## 9. Dataset & Profile Generation Details

### Datasets
**Status**: NOT pre-downloaded. Uses HuggingFace `datasets` library.
- Will be downloaded on first use via `load_dataset()`
- Cached locally after first download
- Some datasets may require authentication (GPQA requires HuggingFace login)

**To pre-download** (optional):
```python
from datasets import load_dataset
datasets_to_download = [
    ("HuggingFaceH4/MATH-500", None),
    ("lighteval/MATH-Hard", None),
    ("openai/openai_humaneval", None),
    ("bigcode/bigcodebench", None),
    ("Idavidrein/gpqa", "gpqa_diamond"),  # May need HF token
    ("TIGER-Lab/TheoremQA", None),
    # etc.
]
for name, config in datasets_to_download:
    load_dataset(name, config)
```

### Profile Generation
**LLM Used**: `meta-llama/Llama-3.1-70B-Instruct` via `litellm`
- Requires litellm installed: `pip install litellm`
- Needs API endpoint or local model setup
- Fallback: `generate_from_schema()` function creates profiles from predefined schema (no LLM needed)

**Generation command**:
```bash
# With LLM
python scripts/generate_profiles_v2.py --n-profiles 100 --model meta-llama/Llama-3.1-70B-Instruct

# Without LLM (from schema)
python scripts/generate_profiles_v2.py --n-profiles 100 --from-schema data/preference_schema_v2_sample.json
```

---

## 10. Notes for Future Sessions

- User prefers 100 profiles but is flexible based on compute
- User wants all new challenging datasets included
- User wants LLM as judge (not programmatic)
- User LLM should be Llama-70B, Agent should be Llama-8B
- PDF in `collaborativeagents/` folder describes original experiments
- User's PersonalizedLLM has easy-to-use `chat()` function - prefer using it

---

*Last updated: Session where datasets_extended.py and llm_judge.py were created*