diff options
| author | YurenHao0426 <blackhao0426@gmail.com> | 2026-01-27 12:15:45 -0600 |
|---|---|---|
| committer | YurenHao0426 <blackhao0426@gmail.com> | 2026-01-27 12:15:45 -0600 |
| commit | 680513b7771a29f27cbbb3ffb009a69a913de6f9 (patch) | |
| tree | a0d60aef9ade1b2953b915f535b990c0de95e493 /CLAUDE.md | |
| parent | c06ec2f3b80f8968f09eb801b69237495b055ec1 (diff) | |
local reward model
Diffstat (limited to 'CLAUDE.md')
| -rw-r--r-- | CLAUDE.md | 64 |
1 files changed, 60 insertions, 4 deletions
@@ -104,6 +104,7 @@ HF_TOKEN=hf_... # For gated models ### 3. Models Required - **Agent LLM**: `models/llama-3.1-8b-instruct/` (local) or vLLM server - **User Simulator**: 70B model via vLLM or OpenAI API (gpt-5-mini) +- **Reward Model**: `models/llama-3.1-8b-instruct/` for local reward (optional, same as agent) - **Embeddings**: `models/qwen3-embedding-8b/` (for RAG methods) - **Reranker**: `models/rerankers/` or Qwen3-8B (for RAG methods) @@ -156,7 +157,8 @@ sbatch test_local_user.sh | `--use-openai-user` | Use OpenAI API for user simulator | | `--vllm-user-url` | Local vLLM user simulator URL | | `--parallel-profiles` | Batch size for turn-synchronous processing | -| `--reward-mode` | `keyword` (heuristic) or `llm` (GPT-5-nano judge) | +| `--reward-mode` | `keyword`, `llm` (GPT-4o-mini), or `llm_local` (local vLLM) | +| `--reward-vllm-url` | vLLM URL for local reward model (when `--reward-mode=llm_local`) | --- @@ -179,6 +181,57 @@ Located in `collaborativeagents/results/`: --- +## Reward Models + +The system supports three reward modes for RL updates: + +| Mode | Model | Pros | Cons | +|------|-------|------|------| +| `keyword` | Heuristic | Fast, no API/GPU | Less accurate | +| `llm` | GPT-4o-mini (API) | High accuracy | API costs, latency | +| `llm_local` | Llama-3.1-8B (vLLM) | High accuracy, no API costs, batch processing | Requires GPU | + +### Local Reward Model Setup + +The local reward model uses the same classification prompt as GPT-4o-mini but runs on a local vLLM server. + +**Test Results** (Llama-3.1-8B vs GPT-4o-mini): +- Accuracy: 83-92% (both models) +- Agreement: 100% +- Throughput: 3.6 samples/sec (batched) + +**Files**: +- `src/personalization/feedback/local_llm_reward.py` - `LocalLLMRewardClient` with batch support +- `scripts/test_local_reward_batch.py` - Test script + +**Usage**: +```bash +# Start reward model vLLM server (GPU 3) +python -m vllm.entrypoints.openai.api_server \ + --model models/llama-3.1-8b-instruct \ + --port 8005 \ + --tensor-parallel-size 1 \ + --dtype bfloat16 \ + --max-model-len 4096 + +# Run experiment with local reward +python run_experiments.py \ + --reward-mode llm_local \ + --reward-vllm-url http://localhost:8005/v1 \ + ... +``` + +**Classification Labels**: +- `neg_constraint_restate`: User reasserts constraints (reward: -1.0) +- `neg_correction`: User indicates content is wrong (reward: -0.8) +- `neg_confusion`: User indicates confusion (reward: -0.6) +- `pos_praise`: Explicit praise (reward: +0.8) +- `pos_progress`: Constructive continuation (reward: +0.1) +- `neutral`: Ambiguous feedback (reward: 0.0) +- `topic_shift`: New topic (reward: 0.0, no update) + +--- + ## Experiments To Be Done ### 1. Full-Scale Benchmark (Priority: HIGH) @@ -187,15 +240,17 @@ Located in `collaborativeagents/results/`: **Setup**: - User simulator: Local 70B (vLLM, TP=2) - NOT OpenAI (too slow) - Agent: 8B LLaMA (vLLM) -- Reward: LLM judge (GPT-5-nano via API) +- Reward: Local 8B LLM judge (vLLM) - faster than API, same accuracy **GPU Layout** (4x A100 80GB): ``` GPU 0-1: 70B user simulator (AWQ INT4, TP=2) GPU 2: 8B agent -GPU 3: Embedding + Reranker (for RAG methods) +GPU 3: 8B reward model + Embedding + Reranker (for RAG methods) ``` +**Note**: The reward model (Llama-3.1-8B) shares GPU 3 with embedding/reranker models. Use `--reward-mode llm_local --reward-vllm-url http://localhost:8005/v1` to enable. + **Jobs**: Split by method and profile range (50 profiles each) ``` collaborativeagents/slurm/fullscale/ @@ -231,7 +286,8 @@ python run_experiments.py \ | `src/personalization/serving/personalized_llm.py` | Main inference class with `chat()`, `chat_prepare()`, `chat_complete()` | | `src/personalization/models/llm/vllm_chat.py` | vLLM HTTP client with `build_messages()`, `answer()` | | `src/personalization/retrieval/policy.py` | Memory retrieval with user vector | -| `src/personalization/feedback/llm_reward.py` | GPT-based reward judge | +| `src/personalization/feedback/llm_reward.py` | GPT-based reward judge (API) | +| `src/personalization/feedback/local_llm_reward.py` | Local vLLM-based reward judge (batch processing) | ### Experiment Framework | File | Purpose | |
