summaryrefslogtreecommitdiff
path: root/CLAUDE.md
diff options
context:
space:
mode:
authorYurenHao0426 <blackhao0426@gmail.com>2026-01-27 12:15:45 -0600
committerYurenHao0426 <blackhao0426@gmail.com>2026-01-27 12:15:45 -0600
commit680513b7771a29f27cbbb3ffb009a69a913de6f9 (patch)
treea0d60aef9ade1b2953b915f535b990c0de95e493 /CLAUDE.md
parentc06ec2f3b80f8968f09eb801b69237495b055ec1 (diff)
local reward model
Diffstat (limited to 'CLAUDE.md')
-rw-r--r--CLAUDE.md64
1 files changed, 60 insertions, 4 deletions
diff --git a/CLAUDE.md b/CLAUDE.md
index 8819e73..b7b4ccd 100644
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -104,6 +104,7 @@ HF_TOKEN=hf_... # For gated models
### 3. Models Required
- **Agent LLM**: `models/llama-3.1-8b-instruct/` (local) or vLLM server
- **User Simulator**: 70B model via vLLM or OpenAI API (gpt-5-mini)
+- **Reward Model**: `models/llama-3.1-8b-instruct/` for local reward (optional, same as agent)
- **Embeddings**: `models/qwen3-embedding-8b/` (for RAG methods)
- **Reranker**: `models/rerankers/` or Qwen3-8B (for RAG methods)
@@ -156,7 +157,8 @@ sbatch test_local_user.sh
| `--use-openai-user` | Use OpenAI API for user simulator |
| `--vllm-user-url` | Local vLLM user simulator URL |
| `--parallel-profiles` | Batch size for turn-synchronous processing |
-| `--reward-mode` | `keyword` (heuristic) or `llm` (GPT-5-nano judge) |
+| `--reward-mode` | `keyword`, `llm` (GPT-4o-mini), or `llm_local` (local vLLM) |
+| `--reward-vllm-url` | vLLM URL for local reward model (when `--reward-mode=llm_local`) |
---
@@ -179,6 +181,57 @@ Located in `collaborativeagents/results/`:
---
+## Reward Models
+
+The system supports three reward modes for RL updates:
+
+| Mode | Model | Pros | Cons |
+|------|-------|------|------|
+| `keyword` | Heuristic | Fast, no API/GPU | Less accurate |
+| `llm` | GPT-4o-mini (API) | High accuracy | API costs, latency |
+| `llm_local` | Llama-3.1-8B (vLLM) | High accuracy, no API costs, batch processing | Requires GPU |
+
+### Local Reward Model Setup
+
+The local reward model uses the same classification prompt as GPT-4o-mini but runs on a local vLLM server.
+
+**Test Results** (Llama-3.1-8B vs GPT-4o-mini):
+- Accuracy: 83-92% (both models)
+- Agreement: 100%
+- Throughput: 3.6 samples/sec (batched)
+
+**Files**:
+- `src/personalization/feedback/local_llm_reward.py` - `LocalLLMRewardClient` with batch support
+- `scripts/test_local_reward_batch.py` - Test script
+
+**Usage**:
+```bash
+# Start reward model vLLM server (GPU 3)
+python -m vllm.entrypoints.openai.api_server \
+ --model models/llama-3.1-8b-instruct \
+ --port 8005 \
+ --tensor-parallel-size 1 \
+ --dtype bfloat16 \
+ --max-model-len 4096
+
+# Run experiment with local reward
+python run_experiments.py \
+ --reward-mode llm_local \
+ --reward-vllm-url http://localhost:8005/v1 \
+ ...
+```
+
+**Classification Labels**:
+- `neg_constraint_restate`: User reasserts constraints (reward: -1.0)
+- `neg_correction`: User indicates content is wrong (reward: -0.8)
+- `neg_confusion`: User indicates confusion (reward: -0.6)
+- `pos_praise`: Explicit praise (reward: +0.8)
+- `pos_progress`: Constructive continuation (reward: +0.1)
+- `neutral`: Ambiguous feedback (reward: 0.0)
+- `topic_shift`: New topic (reward: 0.0, no update)
+
+---
+
## Experiments To Be Done
### 1. Full-Scale Benchmark (Priority: HIGH)
@@ -187,15 +240,17 @@ Located in `collaborativeagents/results/`:
**Setup**:
- User simulator: Local 70B (vLLM, TP=2) - NOT OpenAI (too slow)
- Agent: 8B LLaMA (vLLM)
-- Reward: LLM judge (GPT-5-nano via API)
+- Reward: Local 8B LLM judge (vLLM) - faster than API, same accuracy
**GPU Layout** (4x A100 80GB):
```
GPU 0-1: 70B user simulator (AWQ INT4, TP=2)
GPU 2: 8B agent
-GPU 3: Embedding + Reranker (for RAG methods)
+GPU 3: 8B reward model + Embedding + Reranker (for RAG methods)
```
+**Note**: The reward model (Llama-3.1-8B) shares GPU 3 with embedding/reranker models. Use `--reward-mode llm_local --reward-vllm-url http://localhost:8005/v1` to enable.
+
**Jobs**: Split by method and profile range (50 profiles each)
```
collaborativeagents/slurm/fullscale/
@@ -231,7 +286,8 @@ python run_experiments.py \
| `src/personalization/serving/personalized_llm.py` | Main inference class with `chat()`, `chat_prepare()`, `chat_complete()` |
| `src/personalization/models/llm/vllm_chat.py` | vLLM HTTP client with `build_messages()`, `answer()` |
| `src/personalization/retrieval/policy.py` | Memory retrieval with user vector |
-| `src/personalization/feedback/llm_reward.py` | GPT-based reward judge |
+| `src/personalization/feedback/llm_reward.py` | GPT-based reward judge (API) |
+| `src/personalization/feedback/local_llm_reward.py` | Local vLLM-based reward judge (batch processing) |
### Experiment Framework
| File | Purpose |