Add preference format compliance paragraph to discussion

Discusses how structured JSON preferences are harder for 8B agent to follow vs Reflection's natural language. Notes prompt template bias toward Reflection. Reports RAG+Rewrite improvement (+0.8pp success, -1.4pp timeout), closing ~50% of RAG-Reflection gap. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
author: YurenHao0426 <blackhao0426@gmail.com> 2026-02-11 03:03:53 +0000
committer: YurenHao0426 <blackhao0426@gmail.com> 2026-02-11 03:03:53 +0000
commit: 713b394a94207c8fa8a8f712bb548ee661017caf (patch)
tree: 9aaf5df4a47869df243215677f3051a4ad413a37 /docs
parent: 8e9248a361b0b3fb4f3dbd37a14b1028afeb3015 (diff)
1 files changed, 36 insertions, 0 deletions
diff --git a/docs/discussion_revised.md b/docs/discussion_revised.md
index 5d02dc6..5152156 100644
--- a/docs/discussion_revised.md
+++ b/docs/discussion_revised.md
@@ -81,6 +81,42 @@ GRPO-trained reflection agents from the original
 \textsc{MultiSessionCollab} paper remains future work and
 would likely raise the Reflection ceiling further.
 
+\paragraph{Preference format and agent compliance.}
+An important factor in RAG-based methods is the \emph{format}
+in which retrieved preferences are injected into the agent
+prompt.
+Our preference extractor stores preferences as structured
+condition--action rules (e.g., \texttt{"condition": "math
+problems", "action": "show step-by-step work"}).
+We find that directly injecting these structured rules into
+the prompt of an 8B-parameter agent often fails to elicit
+compliance, even when the correct preferences are retrieved:
+the agent appears to treat the structured format as metadata
+rather than actionable instructions.
+By contrast, Reflection's session-level summaries are
+expressed in natural language and are more readily followed
+by the agent.
+This representational mismatch is compounded by the fact
+that we adopt the agent prompt template from
+\textsc{MultiSessionCollab}, which was specifically designed
+and tuned for Reflection-style notes rather than structured
+preference cards, further disadvantaging RAG-based methods.
+
+To address this, we introduce a lightweight \emph{rewrite}
+module that uses the agent LLM to merge the top-$k$ retrieved
+preference cards into one or two fluent natural-language
+instructions before prompt injection.
+In a variant experiment (RAG+Rewrite), this step improves
+task success by $+0.8$~pp over plain RAG ($51.6\% \to
+52.4\%$) and reduces timeout rate by $1.4$~pp
+($25.8\% \to 24.4\%$), closing roughly half of the
+gap between RAG and Reflection ($53.3\%$).
+These results suggest that preference compliance is
+bottlenecked not only by retrieval quality but also by how
+preferences are \emph{presented} to the agent, and that
+bridging this format gap is a promising direction for
+further improvement.
+
 \paragraph{Why does Vanilla perform well?}
 A perhaps surprising finding is that Vanilla ($54.3\%$)
 nearly matches Reflection ($54.4\%$), despite having no
author	YurenHao0426 <blackhao0426@gmail.com>	2026-02-11 03:03:53 +0000
committer	YurenHao0426 <blackhao0426@gmail.com>	2026-02-11 03:03:53 +0000
commit	713b394a94207c8fa8a8f712bb548ee661017caf (patch)
tree	9aaf5df4a47869df243215677f3051a4ad413a37 /docs
parent	8e9248a361b0b3fb4f3dbd37a14b1028afeb3015 (diff)