summaryrefslogtreecommitdiff
path: root/docs/reward_section_revised.md
AgeCommit message (Collapse)Author
2026-02-11Add revised reward modeling LaTeX section matching code implementationYurenHao0426
Key changes from original: - Input: (q_t, a_t, q_{t+1}) only, removed A_t (not used in judge prompt) - Single 7-label LLM classifier replaces abstract C_reward/C_gate - Gating = classifier confidence (threshold tau_c=0.6), not memory attribution - Explicitly describes Llama-3.1-8B-Instruct as judge model Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>