# MULTISESSIONCOLLAB Experiment Notes

## Experiment Configuration

### Scale
- **Profiles**: 200 users
- **Sessions per profile**: 30
- **Max turns per session**: 15
- **Total sessions**: 6,000 per method
- **Parallel profiles**: 10

### Datasets
- math-hard
- math-500
- bigcodebench

### Methods (7 total)
1. **vanilla** - Direct LLM without personalization (COMPLETED - 97.3% success)
2. **all_memory** - Full conversation history in context
3. **rag** - BM25-based retrieval
4. **rag_vector** - Vector embedding retrieval
5. **contextual** - Context-aware adaptation
6. **reflection** - Self-reflection mechanism
7. **reflection_grpo** - GRPO-trained reflection (requires SFT training completion)

---

## GPU Architecture (H200 x 4)

### Method-Specific Configuration

#### contextual / reflection (vLLM-based)
```
GPU 0,1: vLLM server (port 8004) - user simulation - 90% memory
GPU 2,3: vLLM server (port 8003) - agent inference - 90% memory
Both use tensor-parallel-size 2, LLaMA 3.1 8B
```
- ContextualAdapter and ReflectionAdapter now use VLLMClient (HTTP API)
- Parallelism: 50 profiles (HTTP requests, no local GPU needed)
- Expected throughput: ~3000+ sessions/hr

#### all_memory / rag / rag_vector (transformers-based)
```
GPU 0,1: vLLM server (port 8004) - user simulation - 90% memory
GPU 2,3: PersonalizedLLMAdapter's transformers models (via CUDA_VISIBLE_DEVICES=2,3)
         - embedding ~8B (Qwen3Embedding8B)
         - reranker ~8B (Qwen3Reranker)
         - chat ~1.5B (Qwen1.5B)
         - extractor ~0.6B (Qwen3_0.6B_SFT)
```
- These methods use PersonalizedLLM which requires custom model code
- Parallelism: 10 profiles (limited by GPU memory for transformers)
- Expected throughput: ~200-500 sessions/hr (slower due to transformers)

#### vanilla (vLLM batch processing)
```
GPU 0,1: vLLM server (port 8003) - agent inference
GPU 2,3: vLLM server (port 8004) - user simulation
Both at 90% memory utilization
```
- Uses batch processing with turn-synchronous batching
- Parallelism: 50 conversations batched together
- Expected throughput: ~3000+ sessions/hr

---

## Key Fixes Applied

### 1. FileNotFoundError: configs/local_models.yaml
**Affected**: all_memory, rag, rag_vector (PersonalizedLLMAdapter methods)

**Fix**: Created symlink
```bash
ln -sf /projects/bfqt/users/yurenh2/ml-projects/personalization-user-model/configs/local_models.yaml \
       /projects/bfqt/users/yurenh2/ml-projects/personalization-user-model/collaborativeagents/configs/local_models.yaml
```

### 2. RecursionError / Meta Device Error
**Affected**: contextual, reflection (transformers-based adapters)

**Cause**: vLLM using all 4 GPUs with 90% memory, no memory left for adapter

**Fix**: Isolate GPUs
- vLLM on GPU 0,1 only (CUDA_VISIBLE_DEVICES=0,1 for vLLM server)
- Adapter on GPU 2,3 (CUDA_VISIBLE_DEVICES=2,3 for run_experiments.py)

---

## Job IDs

### Current Experiments (H200 gpuH200x8 partition)
| Job ID | Method | Config | Status |
|--------|--------|--------|--------|
| 14897651 | contextual | vLLM (2 servers) | Pending |
| 14897652 | reflection | vLLM (2 servers) | Pending |
| 14897653 | all_memory | vLLM + transformers | Pending |
| 14897654 | rag | vLLM + transformers | Pending |
| 14897655 | rag_vector | vLLM + transformers | Pending |
| 14814526 | sft_train | Training | Pending |

### Completed
- vanilla: 97.3% task success (Job 14604065)

### Cancelled (old config with transformers-only adapters)
- 14896375-14896379

---

## File Paths

### Models
```
MODEL_8B="/projects/bfqt/users/yurenh2/ml-projects/personalization-user-model/models/llama-3.1-8b-instruct"
```

### Profiles
```
PROFILE_PATH="/projects/bfqt/users/yurenh2/ml-projects/personalization-user-model/collaborativeagents/data/complex_profiles_v2/profiles_200.jsonl"
```

### Output
```
--output-dir ../results/full_h200
```

---

## SLURM Settings

```bash
#SBATCH --account=bfqt-delta-gpu
#SBATCH --partition=gpuH200x8
#SBATCH --gres=gpu:4
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=32
#SBATCH --mem=256G
#SBATCH --time=24:00:00
```

---

## Fixed: ContextualAdapter and ReflectionAdapter now use vLLM

**Previous issue**: These adapters used slow transformers (~20 sessions/hr).

**Solution**: Modified adapters to use `VLLMClient` (HTTP API):
- `adapters/contextual_adapter.py` - Uses vLLM server on port 8003
- `adapters/reflection_adapter.py` - Uses vLLM server on port 8003
- Expected speedup: ~150x (from ~20 to ~3000+ sessions/hr)

**Still using transformers** (for all_memory, rag, rag_vector):
- PersonalizedLLMAdapter requires custom model code (embedding, reranker, extractor)
- These cannot easily be replaced with vLLM without major refactoring

---

## Commands Reference

### Check job status
```bash
squeue -u yurenh2
```

### Check job details
```bash
scontrol show job <JOB_ID>
```

### View output logs
```bash
tail -f /projects/bfqt/users/yurenh2/ml-projects/personalization-user-model/exp_<method>-<jobid>.out
```

### Cancel job
```bash
scancel <JOB_ID>
```

---

## Lessons Learned

1. **ALWAYS test on interactive nodes before submitting batch jobs**
2. **Understand GPU memory allocation** - vLLM pre-allocates memory
3. **Check which components actually use which servers** - not all adapters use vLLM
4. **Use CUDA_VISIBLE_DEVICES** to isolate GPU usage between processes
5. **Only vanilla method uses VLLMAgentClient** - other adapters load their own models

---

Last updated: 2024-12-31