# MULTISESSIONCOLLAB Experiment Notes ## Experiment Configuration ### Scale - **Profiles**: 200 users - **Sessions per profile**: 30 - **Max turns per session**: 15 - **Total sessions**: 6,000 per method - **Parallel profiles**: 10 ### Datasets - math-hard - math-500 - bigcodebench ### Methods (7 total) 1. **vanilla** - Direct LLM without personalization (COMPLETED - 97.3% success) 2. **all_memory** - Full conversation history in context 3. **rag** - BM25-based retrieval 4. **rag_vector** - Vector embedding retrieval 5. **contextual** - Context-aware adaptation 6. **reflection** - Self-reflection mechanism 7. **reflection_grpo** - GRPO-trained reflection (requires SFT training completion) --- ## GPU Architecture (H200 x 4) ### Method-Specific Configuration #### contextual / reflection (vLLM-based) ``` GPU 0,1: vLLM server (port 8004) - user simulation - 90% memory GPU 2,3: vLLM server (port 8003) - agent inference - 90% memory Both use tensor-parallel-size 2, LLaMA 3.1 8B ``` - ContextualAdapter and ReflectionAdapter now use VLLMClient (HTTP API) - Parallelism: 50 profiles (HTTP requests, no local GPU needed) - Expected throughput: ~3000+ sessions/hr #### all_memory / rag / rag_vector (transformers-based) ``` GPU 0,1: vLLM server (port 8004) - user simulation - 90% memory GPU 2,3: PersonalizedLLMAdapter's transformers models (via CUDA_VISIBLE_DEVICES=2,3) - embedding ~8B (Qwen3Embedding8B) - reranker ~8B (Qwen3Reranker) - chat ~1.5B (Qwen1.5B) - extractor ~0.6B (Qwen3_0.6B_SFT) ``` - These methods use PersonalizedLLM which requires custom model code - Parallelism: 10 profiles (limited by GPU memory for transformers) - Expected throughput: ~200-500 sessions/hr (slower due to transformers) #### vanilla (vLLM batch processing) ``` GPU 0,1: vLLM server (port 8003) - agent inference GPU 2,3: vLLM server (port 8004) - user simulation Both at 90% memory utilization ``` - Uses batch processing with turn-synchronous batching - Parallelism: 50 conversations batched together - Expected throughput: ~3000+ sessions/hr --- ## Key Fixes Applied ### 1. FileNotFoundError: configs/local_models.yaml **Affected**: all_memory, rag, rag_vector (PersonalizedLLMAdapter methods) **Fix**: Created symlink ```bash ln -sf /projects/bfqt/users/yurenh2/ml-projects/personalization-user-model/configs/local_models.yaml \ /projects/bfqt/users/yurenh2/ml-projects/personalization-user-model/collaborativeagents/configs/local_models.yaml ``` ### 2. RecursionError / Meta Device Error **Affected**: contextual, reflection (transformers-based adapters) **Cause**: vLLM using all 4 GPUs with 90% memory, no memory left for adapter **Fix**: Isolate GPUs - vLLM on GPU 0,1 only (CUDA_VISIBLE_DEVICES=0,1 for vLLM server) - Adapter on GPU 2,3 (CUDA_VISIBLE_DEVICES=2,3 for run_experiments.py) --- ## Job IDs ### Current Experiments (H200 gpuH200x8 partition) | Job ID | Method | Config | Status | |--------|--------|--------|--------| | 14897651 | contextual | vLLM (2 servers) | Pending | | 14897652 | reflection | vLLM (2 servers) | Pending | | 14897653 | all_memory | vLLM + transformers | Pending | | 14897654 | rag | vLLM + transformers | Pending | | 14897655 | rag_vector | vLLM + transformers | Pending | | 14814526 | sft_train | Training | Pending | ### Completed - vanilla: 97.3% task success (Job 14604065) ### Cancelled (old config with transformers-only adapters) - 14896375-14896379 --- ## File Paths ### Models ``` MODEL_8B="/projects/bfqt/users/yurenh2/ml-projects/personalization-user-model/models/llama-3.1-8b-instruct" ``` ### Profiles ``` PROFILE_PATH="/projects/bfqt/users/yurenh2/ml-projects/personalization-user-model/collaborativeagents/data/complex_profiles_v2/profiles_200.jsonl" ``` ### Output ``` --output-dir ../results/full_h200 ``` --- ## SLURM Settings ```bash #SBATCH --account=bfqt-delta-gpu #SBATCH --partition=gpuH200x8 #SBATCH --gres=gpu:4 #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --cpus-per-task=32 #SBATCH --mem=256G #SBATCH --time=24:00:00 ``` --- ## Fixed: ContextualAdapter and ReflectionAdapter now use vLLM **Previous issue**: These adapters used slow transformers (~20 sessions/hr). **Solution**: Modified adapters to use `VLLMClient` (HTTP API): - `adapters/contextual_adapter.py` - Uses vLLM server on port 8003 - `adapters/reflection_adapter.py` - Uses vLLM server on port 8003 - Expected speedup: ~150x (from ~20 to ~3000+ sessions/hr) **Still using transformers** (for all_memory, rag, rag_vector): - PersonalizedLLMAdapter requires custom model code (embedding, reranker, extractor) - These cannot easily be replaced with vLLM without major refactoring --- ## Commands Reference ### Check job status ```bash squeue -u yurenh2 ``` ### Check job details ```bash scontrol show job ``` ### View output logs ```bash tail -f /projects/bfqt/users/yurenh2/ml-projects/personalization-user-model/exp_-.out ``` ### Cancel job ```bash scancel ``` --- ## Lessons Learned 1. **ALWAYS test on interactive nodes before submitting batch jobs** 2. **Understand GPU memory allocation** - vLLM pre-allocates memory 3. **Check which components actually use which servers** - not all adapters use vLLM 4. **Use CUDA_VISIBLE_DEVICES** to isolate GPU usage between processes 5. **Only vanilla method uses VLLMAgentClient** - other adapters load their own models --- Last updated: 2024-12-31