1 files changed, 192 insertions, 0 deletions
diff --git a/collaborativeagents/EXPERIMENT_NOTES.md b/collaborativeagents/EXPERIMENT_NOTES.md
new file mode 100644
index 0000000..5f7ea1c
--- /dev/null
+++ b/collaborativeagents/EXPERIMENT_NOTES.md
@@ -0,0 +1,192 @@
+# MULTISESSIONCOLLAB Experiment Notes
+
+## Experiment Configuration
+
+### Scale
+- **Profiles**: 200 users
+- **Sessions per profile**: 30
+- **Max turns per session**: 15
+- **Total sessions**: 6,000 per method
+- **Parallel profiles**: 10
+
+### Datasets
+- math-hard
+- math-500
+- bigcodebench
+
+### Methods (7 total)
+1. **vanilla** - Direct LLM without personalization (COMPLETED - 97.3% success)
+2. **all_memory** - Full conversation history in context
+3. **rag** - BM25-based retrieval
+4. **rag_vector** - Vector embedding retrieval
+5. **contextual** - Context-aware adaptation
+6. **reflection** - Self-reflection mechanism
+7. **reflection_grpo** - GRPO-trained reflection (requires SFT training completion)
+
+---
+
+## GPU Architecture (H200 x 4)
+
+### Method-Specific Configuration
+
+#### contextual / reflection (vLLM-based)
+```
+GPU 0,1: vLLM server (port 8004) - user simulation - 90% memory
+GPU 2,3: vLLM server (port 8003) - agent inference - 90% memory
+Both use tensor-parallel-size 2, LLaMA 3.1 8B
+```
+- ContextualAdapter and ReflectionAdapter now use VLLMClient (HTTP API)
+- Parallelism: 50 profiles (HTTP requests, no local GPU needed)
+- Expected throughput: ~3000+ sessions/hr
+
+#### all_memory / rag / rag_vector (transformers-based)
+```
+GPU 0,1: vLLM server (port 8004) - user simulation - 90% memory
+GPU 2,3: PersonalizedLLMAdapter's transformers models (via CUDA_VISIBLE_DEVICES=2,3)
+         - embedding ~8B (Qwen3Embedding8B)
+         - reranker ~8B (Qwen3Reranker)
+         - chat ~1.5B (Qwen1.5B)
+         - extractor ~0.6B (Qwen3_0.6B_SFT)
+```
+- These methods use PersonalizedLLM which requires custom model code
+- Parallelism: 10 profiles (limited by GPU memory for transformers)
+- Expected throughput: ~200-500 sessions/hr (slower due to transformers)
+
+#### vanilla (vLLM batch processing)
+```
+GPU 0,1: vLLM server (port 8003) - agent inference
+GPU 2,3: vLLM server (port 8004) - user simulation
+Both at 90% memory utilization
+```
+- Uses batch processing with turn-synchronous batching
+- Parallelism: 50 conversations batched together
+- Expected throughput: ~3000+ sessions/hr
+
+---
+
+## Key Fixes Applied
+
+### 1. FileNotFoundError: configs/local_models.yaml
+**Affected**: all_memory, rag, rag_vector (PersonalizedLLMAdapter methods)
+
+**Fix**: Created symlink
+```bash
+ln -sf /projects/bfqt/users/yurenh2/ml-projects/personalization-user-model/configs/local_models.yaml \
+       /projects/bfqt/users/yurenh2/ml-projects/personalization-user-model/collaborativeagents/configs/local_models.yaml
+```
+
+### 2. RecursionError / Meta Device Error
+**Affected**: contextual, reflection (transformers-based adapters)
+
+**Cause**: vLLM using all 4 GPUs with 90% memory, no memory left for adapter
+
+**Fix**: Isolate GPUs
+- vLLM on GPU 0,1 only (CUDA_VISIBLE_DEVICES=0,1 for vLLM server)
+- Adapter on GPU 2,3 (CUDA_VISIBLE_DEVICES=2,3 for run_experiments.py)
+
+---
+
+## Job IDs
+
+### Current Experiments (H200 gpuH200x8 partition)
+| Job ID | Method | Config | Status |
+|--------|--------|--------|--------|
+| 14897651 | contextual | vLLM (2 servers) | Pending |
+| 14897652 | reflection | vLLM (2 servers) | Pending |
+| 14897653 | all_memory | vLLM + transformers | Pending |
+| 14897654 | rag | vLLM + transformers | Pending |
+| 14897655 | rag_vector | vLLM + transformers | Pending |
+| 14814526 | sft_train | Training | Pending |
+
+### Completed
+- vanilla: 97.3% task success (Job 14604065)
+
+### Cancelled (old config with transformers-only adapters)
+- 14896375-14896379
+
+---
+
+## File Paths
+
+### Models
+```
+MODEL_8B="/projects/bfqt/users/yurenh2/ml-projects/personalization-user-model/models/llama-3.1-8b-instruct"
+```
+
+### Profiles
+```
+PROFILE_PATH="/projects/bfqt/users/yurenh2/ml-projects/personalization-user-model/collaborativeagents/data/complex_profiles_v2/profiles_200.jsonl"
+```
+
+### Output
+```
+--output-dir ../results/full_h200
+```
+
+---
+
+## SLURM Settings
+
+```bash
+#SBATCH --account=bfqt-delta-gpu
+#SBATCH --partition=gpuH200x8
+#SBATCH --gres=gpu:4
+#SBATCH --nodes=1
+#SBATCH --ntasks=1
+#SBATCH --cpus-per-task=32
+#SBATCH --mem=256G
+#SBATCH --time=24:00:00
+```
+
+---
+
+## Fixed: ContextualAdapter and ReflectionAdapter now use vLLM
+
+**Previous issue**: These adapters used slow transformers (~20 sessions/hr).
+
+**Solution**: Modified adapters to use `VLLMClient` (HTTP API):
+- `adapters/contextual_adapter.py` - Uses vLLM server on port 8003
+- `adapters/reflection_adapter.py` - Uses vLLM server on port 8003
+- Expected speedup: ~150x (from ~20 to ~3000+ sessions/hr)
+
+**Still using transformers** (for all_memory, rag, rag_vector):
+- PersonalizedLLMAdapter requires custom model code (embedding, reranker, extractor)
+- These cannot easily be replaced with vLLM without major refactoring
+
+---
+
+## Commands Reference
+
+### Check job status
+```bash
+squeue -u yurenh2
+```
+
+### Check job details
+```bash
+scontrol show job <JOB_ID>
+```
+
+### View output logs
+```bash
+tail -f /projects/bfqt/users/yurenh2/ml-projects/personalization-user-model/exp_<method>-<jobid>.out
+```
+
+### Cancel job
+```bash
+scancel <JOB_ID>
+```
+
+---
+
+## Lessons Learned
+
+1. **ALWAYS test on interactive nodes before submitting batch jobs**
+2. **Understand GPU memory allocation** - vLLM pre-allocates memory
+3. **Check which components actually use which servers** - not all adapters use vLLM
+4. **Use CUDA_VISIBLE_DEVICES** to isolate GPU usage between processes
+5. **Only vanilla method uses VLLMAgentClient** - other adapters load their own models
+
+---
+
+Last updated: 2024-12-31