1 files changed, 367 insertions, 0 deletions
diff --git a/collaborativeagents/INTERNAL_NOTES.md b/collaborativeagents/INTERNAL_NOTES.md
new file mode 100644
index 0000000..b24e655
--- /dev/null
+++ b/collaborativeagents/INTERNAL_NOTES.md
@@ -0,0 +1,367 @@
+# Internal Project Notes: Personalization Benchmark
+
+> **Purpose**: Internal documentation for AI assistants and developers to understand project state, design decisions, and next steps.
+
+---
+
+## 1. Project Overview
+
+### What This Project Is
+A benchmark to evaluate **retrieval-based personalization** (user's main project) against various baselines using a **user simulator framework** (adapted from CollaborativeAgents/MULTISESSIONCOLLAB).
+
+### Core Hypothesis
+**Extractor + RAG + User Vector** outperforms all baselines because:
+1. **Scalability**: RAG retrieves only relevant preferences (doesn't overflow context)
+2. **Conflict Resolution**: User vector learns which preferences apply in which situations
+3. **Avoids Over-personalization**: Doesn't apply irrelevant preferences
+
+### User's Main Project Location
+`/projects/bfqt/users/yurenh2/ml-projects/personalization-user-model/src/personalization/`
+
+Key components:
+- `serving/personalized_llm.py` - Main interface with `chat()`, `apply_feedback()`, `reset_session()`
+- `retrieval/pipeline.py` - Dense retrieval + reranking + policy-based selection
+- `user_model/tensor_store.py` - Dual vectors: z_long (persistent), z_short (session)
+- `feedback/policy/reinforce.py` - REINFORCE updates from inferred rewards
+
+---
+
+## 2. Experiment Design
+
+### Baselines (7 methods)
+
+| Method | Description | Expected Weakness |
+|--------|-------------|-------------------|
+| `vanilla` | No memory | Forgets everything |
+| `contextual` | Full history in context, summarize at limit | Summary loses specific preferences |
+| `reflection` | CollaborativeAgents' agent_notes | Notes become unwieldy |
+| `reflection_grpo` | Reflection + GRPO training | Better but no selective retrieval |
+| `all_memory` | All extracted memories in context | Over-personalization, conflict confusion |
+| `rag` | Extractor + RAG (no user vector) | Good retrieval but random on conflicts |
+| `rag_vector` | Extractor + RAG + user vector | **Proposed best method** |
+
+### Datasets
+
+**Existing (from CollaborativeAgents):**
+- math-500, math-hard, humaneval, bigcodebench, logiqa, mmlu, medqa
+
+**New challenging datasets (added):**
+- `gpqa` - PhD-level science (extremely hard)
+- `theoremqa` - Theorem-based math proofs
+- `aime` - Competition math (answers 0-999)
+- `livecodebench` - Recent competitive programming
+- `scicode` - Scientific computing
+
+### Metrics
+
+| Metric | Description | Source |
+|--------|-------------|--------|
+| Task Success | Did agent solve the problem? | LLM judge |
+| User Effort | User's token count + enforcement count | Automatic |
+| Efficiency | Total tokens used | Automatic |
+| Conflict Resolution Accuracy | Picked correct preference in conflicts? | LLM judge |
+| Over-personalization Rate | Applied irrelevant preferences? | LLM judge |
+| Preference Compliance | Followed applicable preferences? | LLM judge |
+
+### User Profiles
+
+**Key change from original:**
+- Original: 3 flat preferences per user (too simple)
+- Ours: **~40 conditional preferences** with explicit conditions
+
+Example:
+```json
+{
+  "condition": "debugging code",
+  "preference": "show full stack traces and line numbers",
+  "conflict_group": "code_detail"
+}
+```
+
+**Conflict groups** (15 types):
+- format_structure (bullets vs numbered)
+- verbosity_level (concise vs detailed)
+- code_naming (snake_case vs camelCase by language)
+- math_detail (derivations vs final answer)
+- guidance_style (incremental vs holistic)
+- etc.
+
+### Conflict Testing Design
+
+**Critical for proving hypothesis.** Every conflict query:
+1. Triggers 2+ conflicting preferences
+2. Has ONE correct preference based on context
+3. RAG naturally retrieves the correct one
+4. Context methods see ALL and get confused
+
+Example:
+```
+Query: "Quick question - how does backpropagation work?"
+
+Triggered preferences:
+  ✓ "When I say 'quick', be concise" (matches query signal)
+  ✗ "For complex ML topics, explain in detail" (lower relevance)
+
+RAG retrieves: concise preference (CORRECT)
+Context methods see: BOTH → confused → often wrong
+```
+
+---
+
+## 3. Technical Architecture
+
+### Models
+- **User Simulator**: Llama-3.3-70B-Instruct (powerful, simulates realistic user)
+- **Agent**: Llama-3.1-8B-Instruct (smaller, realistic deployment)
+- **LLM Judge**: Llama-3.3-70B-Instruct (evaluates quality)
+
+### Compute Resources (from sinfo)
+- `gpu` partition: 4x A100 80GB
+- `h100` partition: 4x H100 80GB
+- Good for running 70B models
+
+### Session Structure
+- 100 user profiles
+- 20 sessions per profile
+- 15 max turns per session
+- 30% of queries are conflict tests
+
+---
+
+## 4. Files Created
+
+### Core Experiment Framework
+
+| File | Status | Purpose |
+|------|--------|---------|
+| `scripts/run_experiments.py` | DONE | Main orchestrator - runs all methods, evaluates, generates report |
+| `scripts/run_baseline_comparison.py` | DONE | Baseline comparison runner |
+| `scripts/conflict_scenario_generator.py` | DONE | Generates 70+ conflict templates |
+| `evaluation/llm_judge.py` | DONE | LLM-as-judge for all metrics |
+| `evaluation/__init__.py` | DONE | Module exports |
+
+### Profile Generation
+
+| File | Status | Purpose |
+|------|--------|---------|
+| `data/preference_schema_v2_sample.json` | DONE | Sample schema with 40 preferences, 15 conflict groups |
+| `scripts/generate_complex_profiles.py` | DONE | LLM-based batch generation |
+| `scripts/generate_profiles_v2.py` | DONE | Profile generation with quality control |
+
+### Dataset & Prompts
+
+| File | Status | Purpose |
+|------|--------|---------|
+| `datasets_extended.py` | DONE | New challenging datasets (GPQA, AIME, etc.) |
+| `prompts_extended.py` | DONE | Step-by-step prompts to make sessions longer |
+
+### Integration
+
+| File | Status | Purpose |
+|------|--------|---------|
+| `adapters/personalized_llm_adapter.py` | DONE | Wraps PersonalizedLLM for benchmark |
+
+### Documentation
+
+| File | Status | Purpose |
+|------|--------|---------|
+| `EXPERIMENT_DESIGN.md` | DONE | Human-readable design doc |
+| `INTERNAL_NOTES.md` | DONE | This file - AI/developer notes |
+
+---
+
+## 5. What's TBD (To Be Done)
+
+### High Priority
+
+1. **Generate 100 User Profiles**
+   - Run `scripts/generate_profiles_v2.py` with LLM
+   - Need SLURM script for cluster
+   - Output: `data/complex_profiles_100.json`
+
+2. **End-to-End Integration Test**
+   - Verify PersonalizedLLM adapter works with conversation generator
+   - Test with 2-3 profiles, 1-2 sessions each
+   - Fix any import/path issues
+
+3. **SLURM Scripts for Cluster**
+   - Profile generation job
+   - Main experiment job (may need multi-GPU)
+
+### Medium Priority
+
+4. **Baseline Adapter Implementations**
+   - `vanilla` - simple, no memory
+   - `contextual` - needs summarization logic
+   - `reflection` - use existing CollaborativeAgents code
+   - `reflection_grpo` - need trained model
+   - `all_memory` - dump all memories in context
+
+5. **User Simulator Integration**
+   - Ensure user agent enforces preferences
+   - Ensure user agent expresses disappointment
+   - Both should count as negative signals
+
+### Lower Priority
+
+6. **User Vector Similarity Analysis**
+   - Create ground truth vectors from preference categories
+   - Compare learned z_long to ground truth
+   - Proves user vectors capture identity
+
+7. **Visualization & Reporting**
+   - Token usage graphs over sessions
+   - Conflict resolution accuracy by type
+   - User effort reduction curves
+
+---
+
+## 6. Key Design Decisions (Reference)
+
+### Why 40 Preferences?
+- Original 3 is too simple - any method can handle it
+- 40 creates realistic complexity
+- Forces context-based methods to overflow or summarize (losing info)
+- Creates many conflict opportunities
+
+### Why Conditional Preferences?
+- Real users have context-dependent preferences
+- "Be concise WHEN I'm rushed" vs "Explain thoroughly WHEN I'm learning"
+- RAG naturally handles this via query similarity
+- Context methods struggle with conflicting conditions
+
+### Why Step-by-Step Prompts?
+- Makes sessions longer (more turns)
+- More opportunities for preference expression/violation
+- Generates more context (stresses context-based methods)
+- More realistic for complex problems
+
+### Why LLM-as-Judge?
+- Preference compliance is subjective
+- Can't programmatically check "was this concise enough?"
+- LLM judge provides nuanced evaluation
+- Using 70B model for quality
+
+---
+
+## 7. Code Locations Reference
+
+### User's Personalization System
+```
+src/personalization/
+├── serving/personalized_llm.py      # Main API
+├── retrieval/pipeline.py            # RAG pipeline
+├── user_model/tensor_store.py       # User vectors
+├── feedback/policy/reinforce.py     # Online learning
+├── models/preference_extractor/     # Extraction
+└── configs/                         # Config files
+```
+
+### CollaborativeAgents (Original)
+```
+collaborativeagents/collaborativeagents/
+├── agents/user_agent.py             # User simulator
+├── agents/collaborator_agent.py     # Agent interface
+├── conversation_generator.py        # Conversation loop
+├── conversation_evaluator.py        # Evaluation
+├── prompts.py                       # Original prompts
+└── datasets/                        # Original datasets
+```
+
+### Our Extensions
+```
+collaborativeagents/
+├── scripts/
+│   ├── run_experiments.py           # Main entry point
+│   ├── generate_profiles_v2.py      # Profile generation
+│   └── conflict_scenario_generator.py
+├── evaluation/
+│   └── llm_judge.py                 # LLM evaluation
+├── adapters/
+│   └── personalized_llm_adapter.py  # Integration
+├── datasets_extended.py             # New datasets
+├── prompts_extended.py              # Step-by-step prompts
+└── data/
+    └── preference_schema_v2_sample.json
+```
+
+---
+
+## 8. Quick Start Commands (TBD - update after testing)
+
+```bash
+# Generate profiles (need SLURM script)
+python scripts/generate_profiles_v2.py --n-profiles 100 --output data/profiles.json
+
+# Run experiments
+python scripts/run_experiments.py \
+  --methods vanilla,contextual,rag,rag_vector \
+  --datasets gpqa,aime \
+  --n-profiles 100 \
+  --n-sessions 20 \
+  --output-dir results/
+
+# Quick test (2 profiles, 2 sessions)
+python scripts/run_experiments.py \
+  --methods rag_vector \
+  --datasets math-500 \
+  --n-profiles 2 \
+  --n-sessions 2 \
+  --output-dir results/test/
+```
+
+---
+
+## 9. Dataset & Profile Generation Details
+
+### Datasets
+**Status**: NOT pre-downloaded. Uses HuggingFace `datasets` library.
+- Will be downloaded on first use via `load_dataset()`
+- Cached locally after first download
+- Some datasets may require authentication (GPQA requires HuggingFace login)
+
+**To pre-download** (optional):
+```python
+from datasets import load_dataset
+datasets_to_download = [
+    ("HuggingFaceH4/MATH-500", None),
+    ("lighteval/MATH-Hard", None),
+    ("openai/openai_humaneval", None),
+    ("bigcode/bigcodebench", None),
+    ("Idavidrein/gpqa", "gpqa_diamond"),  # May need HF token
+    ("TIGER-Lab/TheoremQA", None),
+    # etc.
+]
+for name, config in datasets_to_download:
+    load_dataset(name, config)
+```
+
+### Profile Generation
+**LLM Used**: `meta-llama/Llama-3.1-70B-Instruct` via `litellm`
+- Requires litellm installed: `pip install litellm`
+- Needs API endpoint or local model setup
+- Fallback: `generate_from_schema()` function creates profiles from predefined schema (no LLM needed)
+
+**Generation command**:
+```bash
+# With LLM
+python scripts/generate_profiles_v2.py --n-profiles 100 --model meta-llama/Llama-3.1-70B-Instruct
+
+# Without LLM (from schema)
+python scripts/generate_profiles_v2.py --n-profiles 100 --from-schema data/preference_schema_v2_sample.json
+```
+
+---
+
+## 10. Notes for Future Sessions
+
+- User prefers 100 profiles but is flexible based on compute
+- User wants all new challenging datasets included
+- User wants LLM as judge (not programmatic)
+- User LLM should be Llama-70B, Agent should be Llama-8B
+- PDF in `collaborativeagents/` folder describes original experiments
+- User's PersonalizedLLM has easy-to-use `chat()` function - prefer using it
+
+---
+
+*Last updated: Session where datasets_extended.py and llm_judge.py were created*