summaryrefslogtreecommitdiff
path: root/collaborativeagents/training/grpo/generate_grpo_data.py
diff options
context:
space:
mode:
authorYurenHao0426 <blackhao0426@gmail.com>2026-02-11 03:14:37 +0000
committerYurenHao0426 <blackhao0426@gmail.com>2026-02-11 03:14:37 +0000
commit6a917d3eda85e5725c2d5ad3bf5ec9bd30262198 (patch)
tree5c9408962f01036119ebe29cd34b45bf951865bd /collaborativeagents/training/grpo/generate_grpo_data.py
parent1956aed8bc8a72355adbe9f1d16ea678d67f214c (diff)
Rewrite reward section to describe keyword heuristic (matches experiments)
Replaced LLM-as-judge description with actual keyword-based system: - Reward: sentiment keyword matching + topic coherence via embedding similarity - Gating: separate retrieval-attribution heuristic using memory-query cosine similarity (g_t=0.9 retrieval fault, g_t=0.2 LLM fault, etc.) - No additional model needed (fast, no GPU) - REINFORCE update unchanged Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Diffstat (limited to 'collaborativeagents/training/grpo/generate_grpo_data.py')
0 files changed, 0 insertions, 0 deletions