From dc801c07cf38b0c495686463e6ca6f871a64440e Mon Sep 17 00:00:00 2001 From: YurenHao0426 Date: Tue, 27 Jan 2026 09:57:37 -0600 Subject: Add collaborativeagents module and update gitignore - Add collaborativeagents subproject with adapters, agents, and evaluation modules - Update .gitignore to exclude large binary files (.whl, .tar), wandb logs, and results Co-Authored-By: Claude Opus 4.5 --- collaborativeagents/EXPERIMENT_NOTES.md | 192 ++++++++++++++++++++++++++++++++ 1 file changed, 192 insertions(+) create mode 100644 collaborativeagents/EXPERIMENT_NOTES.md (limited to 'collaborativeagents/EXPERIMENT_NOTES.md') diff --git a/collaborativeagents/EXPERIMENT_NOTES.md b/collaborativeagents/EXPERIMENT_NOTES.md new file mode 100644 index 0000000..5f7ea1c --- /dev/null +++ b/collaborativeagents/EXPERIMENT_NOTES.md @@ -0,0 +1,192 @@ +# MULTISESSIONCOLLAB Experiment Notes + +## Experiment Configuration + +### Scale +- **Profiles**: 200 users +- **Sessions per profile**: 30 +- **Max turns per session**: 15 +- **Total sessions**: 6,000 per method +- **Parallel profiles**: 10 + +### Datasets +- math-hard +- math-500 +- bigcodebench + +### Methods (7 total) +1. **vanilla** - Direct LLM without personalization (COMPLETED - 97.3% success) +2. **all_memory** - Full conversation history in context +3. **rag** - BM25-based retrieval +4. **rag_vector** - Vector embedding retrieval +5. **contextual** - Context-aware adaptation +6. **reflection** - Self-reflection mechanism +7. **reflection_grpo** - GRPO-trained reflection (requires SFT training completion) + +--- + +## GPU Architecture (H200 x 4) + +### Method-Specific Configuration + +#### contextual / reflection (vLLM-based) +``` +GPU 0,1: vLLM server (port 8004) - user simulation - 90% memory +GPU 2,3: vLLM server (port 8003) - agent inference - 90% memory +Both use tensor-parallel-size 2, LLaMA 3.1 8B +``` +- ContextualAdapter and ReflectionAdapter now use VLLMClient (HTTP API) +- Parallelism: 50 profiles (HTTP requests, no local GPU needed) +- Expected throughput: ~3000+ sessions/hr + +#### all_memory / rag / rag_vector (transformers-based) +``` +GPU 0,1: vLLM server (port 8004) - user simulation - 90% memory +GPU 2,3: PersonalizedLLMAdapter's transformers models (via CUDA_VISIBLE_DEVICES=2,3) + - embedding ~8B (Qwen3Embedding8B) + - reranker ~8B (Qwen3Reranker) + - chat ~1.5B (Qwen1.5B) + - extractor ~0.6B (Qwen3_0.6B_SFT) +``` +- These methods use PersonalizedLLM which requires custom model code +- Parallelism: 10 profiles (limited by GPU memory for transformers) +- Expected throughput: ~200-500 sessions/hr (slower due to transformers) + +#### vanilla (vLLM batch processing) +``` +GPU 0,1: vLLM server (port 8003) - agent inference +GPU 2,3: vLLM server (port 8004) - user simulation +Both at 90% memory utilization +``` +- Uses batch processing with turn-synchronous batching +- Parallelism: 50 conversations batched together +- Expected throughput: ~3000+ sessions/hr + +--- + +## Key Fixes Applied + +### 1. FileNotFoundError: configs/local_models.yaml +**Affected**: all_memory, rag, rag_vector (PersonalizedLLMAdapter methods) + +**Fix**: Created symlink +```bash +ln -sf /projects/bfqt/users/yurenh2/ml-projects/personalization-user-model/configs/local_models.yaml \ + /projects/bfqt/users/yurenh2/ml-projects/personalization-user-model/collaborativeagents/configs/local_models.yaml +``` + +### 2. RecursionError / Meta Device Error +**Affected**: contextual, reflection (transformers-based adapters) + +**Cause**: vLLM using all 4 GPUs with 90% memory, no memory left for adapter + +**Fix**: Isolate GPUs +- vLLM on GPU 0,1 only (CUDA_VISIBLE_DEVICES=0,1 for vLLM server) +- Adapter on GPU 2,3 (CUDA_VISIBLE_DEVICES=2,3 for run_experiments.py) + +--- + +## Job IDs + +### Current Experiments (H200 gpuH200x8 partition) +| Job ID | Method | Config | Status | +|--------|--------|--------|--------| +| 14897651 | contextual | vLLM (2 servers) | Pending | +| 14897652 | reflection | vLLM (2 servers) | Pending | +| 14897653 | all_memory | vLLM + transformers | Pending | +| 14897654 | rag | vLLM + transformers | Pending | +| 14897655 | rag_vector | vLLM + transformers | Pending | +| 14814526 | sft_train | Training | Pending | + +### Completed +- vanilla: 97.3% task success (Job 14604065) + +### Cancelled (old config with transformers-only adapters) +- 14896375-14896379 + +--- + +## File Paths + +### Models +``` +MODEL_8B="/projects/bfqt/users/yurenh2/ml-projects/personalization-user-model/models/llama-3.1-8b-instruct" +``` + +### Profiles +``` +PROFILE_PATH="/projects/bfqt/users/yurenh2/ml-projects/personalization-user-model/collaborativeagents/data/complex_profiles_v2/profiles_200.jsonl" +``` + +### Output +``` +--output-dir ../results/full_h200 +``` + +--- + +## SLURM Settings + +```bash +#SBATCH --account=bfqt-delta-gpu +#SBATCH --partition=gpuH200x8 +#SBATCH --gres=gpu:4 +#SBATCH --nodes=1 +#SBATCH --ntasks=1 +#SBATCH --cpus-per-task=32 +#SBATCH --mem=256G +#SBATCH --time=24:00:00 +``` + +--- + +## Fixed: ContextualAdapter and ReflectionAdapter now use vLLM + +**Previous issue**: These adapters used slow transformers (~20 sessions/hr). + +**Solution**: Modified adapters to use `VLLMClient` (HTTP API): +- `adapters/contextual_adapter.py` - Uses vLLM server on port 8003 +- `adapters/reflection_adapter.py` - Uses vLLM server on port 8003 +- Expected speedup: ~150x (from ~20 to ~3000+ sessions/hr) + +**Still using transformers** (for all_memory, rag, rag_vector): +- PersonalizedLLMAdapter requires custom model code (embedding, reranker, extractor) +- These cannot easily be replaced with vLLM without major refactoring + +--- + +## Commands Reference + +### Check job status +```bash +squeue -u yurenh2 +``` + +### Check job details +```bash +scontrol show job +``` + +### View output logs +```bash +tail -f /projects/bfqt/users/yurenh2/ml-projects/personalization-user-model/exp_-.out +``` + +### Cancel job +```bash +scancel +``` + +--- + +## Lessons Learned + +1. **ALWAYS test on interactive nodes before submitting batch jobs** +2. **Understand GPU memory allocation** - vLLM pre-allocates memory +3. **Check which components actually use which servers** - not all adapters use vLLM +4. **Use CUDA_VISIBLE_DEVICES** to isolate GPU usage between processes +5. **Only vanilla method uses VLLMAgentClient** - other adapters load their own models + +--- + +Last updated: 2024-12-31 -- cgit v1.2.3