summaryrefslogtreecommitdiff
path: root/docs/reward_section_revised.md
AgeCommit message (Collapse)Author
2026-02-11Rewrite reward section to describe keyword heuristic (matches experiments)YurenHao0426
Replaced LLM-as-judge description with actual keyword-based system: - Reward: sentiment keyword matching + topic coherence via embedding similarity - Gating: separate retrieval-attribution heuristic using memory-query cosine similarity (g_t=0.9 retrieval fault, g_t=0.2 LLM fault, etc.) - No additional model needed (fast, no GPU) - REINFORCE update unchanged Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-11Add revised reward modeling LaTeX section matching code implementationYurenHao0426
Key changes from original: - Input: (q_t, a_t, q_{t+1}) only, removed A_t (not used in judge prompt) - Single 7-label LLM classifier replaces abstract C_reward/C_gate - Gating = classifier confidence (threshold tau_c=0.6), not memory attribution - Explicitly describes Llama-3.1-8B-Instruct as judge model Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>