| Age | Commit message (Collapse) | Author |
|
Key changes:
- Domain: math-only → 3 task domains (math-hard, math-500, bigcodebench)
- Scale: 5 profiles/40 pool → 60 profiles/200 pool, 60 sessions
- "correlates strongly" → significance-based claims (p=0.006, p=0.046)
- Contributions rewritten: efficiency gains + dual-vector separation
- Related work paragraphs unchanged (still accurate)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
|
Updated all numbers to match current experiments:
- 55.2% success (not 71%), 60 profiles (not 5), 3 datasets
- Token reduction 16.9% (not "halving")
- Significance results (timeout p=0.046, effort p=0.021)
- Dual-vector separation (z_long p=0.006, z_short p=0.586)
- Updated future work (ablation underway, LLM judge ready)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
|
Discusses how structured JSON preferences are harder for 8B agent
to follow vs Reflection's natural language. Notes prompt template
bias toward Reflection. Reports RAG+Rewrite improvement (+0.8pp
success, -1.4pp timeout), closing ~50% of RAG-Reflection gap.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
|
Complete rewrite with current data (60 profiles × 60 sessions):
- Updated all numbers and removed stale references
- Removed duplicate paragraph
- Added: user vector role analysis (RAG 44.3% → RAG+Vec 26.4% timeout)
- Added: E/T decomposition (79% from enforcements, not negative)
- Added: why Vanilla performs well discussion
- Updated: user-vector geometry (ρ=0.040, dual-vector separation)
- Updated: limitations (keyword reward, no GRPO, 60 profiles)
- Updated: future directions (ablation underway, LLM judge ready)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
|
Key changes:
- Fixed metadata (3 datasets, 60 profiles × 60 sessions)
- Removed false "three random seeds" claim
- Replaced all placeholder/TBD text with actual analysis
- Added significance tests table (paired t-test, p-values)
- Added E/T decomposition analysis
- Filled in user-vector representation analysis with actual data
(Spearman rho, quartile tests, dual-vector separation)
- Added bug cleaning disclosure (repetition bug)
- Refined failure modes section
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
|
Key corrections:
- 3 datasets (math-hard, math-500, bigcodebench), not math-hard only
- 60 profiles × 60 sessions, not 200 profiles × 60 turns
- User simulator: Llama-3.3-70B-Instruct (not 3.1)
- GPU layout: agent on GPU 2, embed/reranker on GPU 3
- Added reward model description
- Fixed incomplete sentence
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
|
Key changes from original:
- Input: (q_t, a_t, q_{t+1}) only, removed A_t (not used in judge prompt)
- Single 7-label LLM classifier replaces abstract C_reward/C_gate
- Gating = classifier confidence (threshold tau_c=0.6), not memory attribution
- Explicitly describes Llama-3.1-8B-Instruct as judge model
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
|
- RAG rewrite adapter and vector preference pipeline in personalized_llm
- 60-session experiment queue scripts (reflection, rag, rag_vector, rag_rewrite)
- Vector-preference correlation analysis and visualization scripts
- Local reward model batch processing improvements
- Updated CLAUDE.md with full experiment documentation and notes
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|