local reward model

author: YurenHao0426 <blackhao0426@gmail.com> 2026-01-27 12:15:45 -0600
committer: YurenHao0426 <blackhao0426@gmail.com> 2026-01-27 12:15:45 -0600
commit: 680513b7771a29f27cbbb3ffb009a69a913de6f9 (patch)
tree: a0d60aef9ade1b2953b915f535b990c0de95e493 /CLAUDE.md
parent: c06ec2f3b80f8968f09eb801b69237495b055ec1 (diff)
1 files changed, 60 insertions, 4 deletions
diff --git a/CLAUDE.md b/CLAUDE.md
index 8819e73..b7b4ccd 100644
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -104,6 +104,7 @@ HF_TOKEN=hf_...                # For gated models
 ### 3. Models Required
 - **Agent LLM**: `models/llama-3.1-8b-instruct/` (local) or vLLM server
 - **User Simulator**: 70B model via vLLM or OpenAI API (gpt-5-mini)
+- **Reward Model**: `models/llama-3.1-8b-instruct/` for local reward (optional, same as agent)
 - **Embeddings**: `models/qwen3-embedding-8b/` (for RAG methods)
 - **Reranker**: `models/rerankers/` or Qwen3-8B (for RAG methods)
 
@@ -156,7 +157,8 @@ sbatch test_local_user.sh
 | `--use-openai-user` | Use OpenAI API for user simulator |
 | `--vllm-user-url` | Local vLLM user simulator URL |
 | `--parallel-profiles` | Batch size for turn-synchronous processing |
-| `--reward-mode` | `keyword` (heuristic) or `llm` (GPT-5-nano judge) |
+| `--reward-mode` | `keyword`, `llm` (GPT-4o-mini), or `llm_local` (local vLLM) |
+| `--reward-vllm-url` | vLLM URL for local reward model (when `--reward-mode=llm_local`) |
 
 ---
 
@@ -179,6 +181,57 @@ Located in `collaborativeagents/results/`:
 
 ---
 
+## Reward Models
+
+The system supports three reward modes for RL updates:
+
+| Mode | Model | Pros | Cons |
+|------|-------|------|------|
+| `keyword` | Heuristic | Fast, no API/GPU | Less accurate |
+| `llm` | GPT-4o-mini (API) | High accuracy | API costs, latency |
+| `llm_local` | Llama-3.1-8B (vLLM) | High accuracy, no API costs, batch processing | Requires GPU |
+
+### Local Reward Model Setup
+
+The local reward model uses the same classification prompt as GPT-4o-mini but runs on a local vLLM server.
+
+**Test Results** (Llama-3.1-8B vs GPT-4o-mini):
+- Accuracy: 83-92% (both models)
+- Agreement: 100%
+- Throughput: 3.6 samples/sec (batched)
+
+**Files**:
+- `src/personalization/feedback/local_llm_reward.py` - `LocalLLMRewardClient` with batch support
+- `scripts/test_local_reward_batch.py` - Test script
+
+**Usage**:
+```bash
+# Start reward model vLLM server (GPU 3)
+python -m vllm.entrypoints.openai.api_server \
+    --model models/llama-3.1-8b-instruct \
+    --port 8005 \
+    --tensor-parallel-size 1 \
+    --dtype bfloat16 \
+    --max-model-len 4096
+
+# Run experiment with local reward
+python run_experiments.py \
+    --reward-mode llm_local \
+    --reward-vllm-url http://localhost:8005/v1 \
+    ...
+```
+
+**Classification Labels**:
+- `neg_constraint_restate`: User reasserts constraints (reward: -1.0)
+- `neg_correction`: User indicates content is wrong (reward: -0.8)
+- `neg_confusion`: User indicates confusion (reward: -0.6)
+- `pos_praise`: Explicit praise (reward: +0.8)
+- `pos_progress`: Constructive continuation (reward: +0.1)
+- `neutral`: Ambiguous feedback (reward: 0.0)
+- `topic_shift`: New topic (reward: 0.0, no update)
+
+---
+
 ## Experiments To Be Done
 
 ### 1. Full-Scale Benchmark (Priority: HIGH)
@@ -187,15 +240,17 @@ Located in `collaborativeagents/results/`:
 **Setup**:
 - User simulator: Local 70B (vLLM, TP=2) - NOT OpenAI (too slow)
 - Agent: 8B LLaMA (vLLM)
-- Reward: LLM judge (GPT-5-nano via API)
+- Reward: Local 8B LLM judge (vLLM) - faster than API, same accuracy
 
 **GPU Layout** (4x A100 80GB):
 ```
 GPU 0-1: 70B user simulator (AWQ INT4, TP=2)
 GPU 2:   8B agent
-GPU 3:   Embedding + Reranker (for RAG methods)
+GPU 3:   8B reward model + Embedding + Reranker (for RAG methods)
 ```
 
+**Note**: The reward model (Llama-3.1-8B) shares GPU 3 with embedding/reranker models. Use `--reward-mode llm_local --reward-vllm-url http://localhost:8005/v1` to enable.
+
 **Jobs**: Split by method and profile range (50 profiles each)
 ```
 collaborativeagents/slurm/fullscale/
@@ -231,7 +286,8 @@ python run_experiments.py \
 | `src/personalization/serving/personalized_llm.py` | Main inference class with `chat()`, `chat_prepare()`, `chat_complete()` |
 | `src/personalization/models/llm/vllm_chat.py` | vLLM HTTP client with `build_messages()`, `answer()` |
 | `src/personalization/retrieval/policy.py` | Memory retrieval with user vector |
-| `src/personalization/feedback/llm_reward.py` | GPT-based reward judge |
+| `src/personalization/feedback/llm_reward.py` | GPT-based reward judge (API) |
+| `src/personalization/feedback/local_llm_reward.py` | Local vLLM-based reward judge (batch processing) |
 
 ### Experiment Framework
 | File | Purpose |
author	YurenHao0426 <blackhao0426@gmail.com>	2026-01-27 12:15:45 -0600
committer	YurenHao0426 <blackhao0426@gmail.com>	2026-01-27 12:15:45 -0600
commit	680513b7771a29f27cbbb3ffb009a69a913de6f9 (patch)
tree	a0d60aef9ade1b2953b915f535b990c0de95e493 /CLAUDE.md
parent	c06ec2f3b80f8968f09eb801b69237495b055ec1 (diff)