Initial release: GAP framework

- Full pipeline: variant generation, multi-judge verification, evaluation - Loaders for OpenAI / Anthropic / Google / xAI / OpenRouter / vLLM - Framework-level mechanism analyses: paired structural overlap, repairability rescue, self-correction probe, cross-model agreement, topic x problem-type interaction - Unicode -> bare-LaTeX cleaner + audit + spot-check - Mirrors https://huggingface.co/datasets/blackhao0426/PutnamGAP
author: Yuren Hao <yurenh2@illinois.edu> 2026-04-08 22:06:05 -0500
committer: Yuren Hao <yurenh2@illinois.edu> 2026-04-08 22:06:05 -0500
commit: 05704d0eb2fa59fe727652465b07db40bcb06c38 (patch)
tree: 8904aca836cf552fd1a5ae8c2174e9f91e70bbbc /putnam-bench-anon
29 files changed, 9465 insertions, 0 deletions
diff --git a/putnam-bench-anon/.gitignore b/putnam-bench-anon/.gitignore
new file mode 100644
index 0000000..27c50b5
--- /dev/null
+++ b/putnam-bench-anon/.gitignore
@@ -0,0 +1,214 @@
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+
+# C extensions
+*.so
+
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+cover/
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+
+# Flask stuff:
+instance/
+.webassets-cache
+
+# Scrapy stuff:
+.scrapy
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+.pybuilder/
+target/
+
+# Jupyter Notebook
+.ipynb_checkpoints
+
+# IPython
+profile_default/
+ipython_config.py
+
+# pyenv
+#   For a library or package, you might want to ignore these files since the code is
+#   intended to run in multiple environments; otherwise, check them in:
+# .python-version
+
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+#Pipfile.lock
+
+# poetry
+#   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
+#   This is especially recommended for binary packages to ensure reproducibility, and is more
+#   commonly ignored for libraries.
+#   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
+#poetry.lock
+
+# pdm
+#   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
+#pdm.lock
+#   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
+#   in version control.
+#   https://pdm.fming.dev/#use-with-ide
+.pdm.toml
+
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
+__pypackages__/
+
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+
+# SageMath parsed files
+*.sage.py
+
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+
+# Spyder project settings
+.spyderproject
+.spyproject
+
+# Rope project settings
+.ropeproject
+
+# mkdocs documentation
+/site
+
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+
+# Pyre type checker
+.pyre/
+
+# pytype static type analyzer
+.pytype/
+
+# Cython debug symbols
+cython_debug/
+
+# PyCharm
+#  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
+#  be added to the global gitignore or merged into this project gitignore.  For a PyCharm
+#  project, it is not recommended to check the API keys and other secrets into version control.
+.idea/
+
+# VS Code
+.vscode/
+
+# HuggingFace cache and models
+.cache/
+models/
+transformers_cache/
+huggingface_cache/
+
+# CUDA cache
+*.cu
+*.cubin
+
+# Model checkpoints and outputs
+checkpoints/
+outputs/
+results/
+logs/
+
+# API keys and sensitive configuration
+.env
+*.key
+api_keys.txt
+secrets.json
+
+# Configuration files with sensitive data
+*config*.json
+.putnam_config.json
+.putnam_env
+
+# Temporary files
+*.tmp
+*.temp
+.DS_Store
+Thumbs.db
+
+# Windows Zone Identifier files
+*:Zone.Identifier
+
+# Data and experiment files
+data/
+experiments/
+test_results/
+
+# Large files
+*.bin
+*.safetensors
+*.h5
+*.pt
+*.pth
+
+# Virtual environments
+putnam-local/
+putnam-env/ 
+\ No newline at end of file
diff --git a/putnam-bench-anon/README.md b/putnam-bench-anon/README.md
new file mode 100644
index 0000000..8010895
--- /dev/null
+++ b/putnam-bench-anon/README.md
@@ -0,0 +1,630 @@
+# Putnam Mathematical Problem Solver
+
+A comprehensive system for evaluating AI models on mathematical problem solving using the Putnam Competition dataset. This project provides a unified interface for testing multiple AI providers (cloud and local) on complex mathematical problems with automated grading.
+
+## Features
+
+- **Multi-Provider Support**: 8 AI providers including OpenAI, Anthropic, Google Gemini, xAI, Kimi, VLLM, VLLM Direct, and HuggingFace
+- **Dynamic Model Selection**: Runtime model configuration for optimal cost/performance
+- **Robust Evaluation**: Specialized prompts for mathematical problem solving and grading
+- **Local Model Support**: Run models locally via VLLM server or direct HuggingFace inference
+- **GPU Optimization**: Tested on RTX 3060 with CUDA support
+- **Cost Estimation**: Built-in cost calculation for different providers
+- **Async Processing**: Efficient handling of large datasets
+- **Error Recovery**: Intelligent retry logic and JSON parsing fallbacks
+- **Unified CLI**: Single command-line interface for all operations
+- **Progress Tracking**: Real-time progress bars with success/failure statistics
+- **Multi-Variant Testing**: Test all 6 problem variants with a single command
+
+## Architecture
+
+The project follows a clean, modular architecture:
+
+```
+putnam-bench-anon/
+├── putnam_cli.py          # 🎯 Main CLI interface (your primary entry point)
+├── loader/                # 🔧 Core evaluation engine
+│   ├── __init__.py
+│   ├── providers/         # AI provider implementations
+│   └── utils/             # Utility functions
+├── scripts/               # 📋 Internal scripts (used by CLI)
+│   ├── batch_evaluate.py  # Batch evaluation logic
+│   ├── health_check.py    # Health checking utilities
+│   ├── benchmark.py       # Performance benchmarking
+│   └── compare_*.py       # Analysis scripts
+├── dataset/               # 📚 Problem dataset
+├── results/               # 📊 Evaluation results
+└── requirements*.txt      # 📦 Dependency management
+```
+
+**Key principle**: Use `putnam_cli.py` for all operations - it provides a clean, consistent interface to all functionality.
+
+## Installation
+
+### Quick Start (Automated)
+
+The easiest way to get started is using our automated installation script:
+
+```bash
+# Clone the repository
+git clone <repository-url>
+cd putnam-bench-anon
+
+# One-command setup (installs everything and configures)
+python install.py
+
+# Or quick install (core packages only)
+python install.py --quick
+```
+
+### Manual Installation
+
+#### Environment Setup
+
+We recommend using a dedicated conda environment to avoid dependency conflicts:
+
+```bash
+# Create dedicated environment
+conda create -n putnam-local python=3.10 -y
+conda activate putnam-local
+
+# Clone the repository
+git clone <repository-url>
+cd putnam-bench-anon
+```
+
+#### Choose Your Installation Type
+
+**Option 1: Full Installation (recommended)**
+```bash
+# Install all dependencies including local model support
+pip install -r requirements.txt
+
+# Install PyTorch with CUDA support (for local models)
+pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
+```
+
+**Quick tqdm install for progress bars:**
+```bash
+pip install tqdm
+```
+
+**Option 2: Minimal Installation (cloud providers only)**
+```bash
+# Install only core dependencies for cloud providers
+pip install -r requirements-minimal.txt
+```
+
+**Option 3: Local Models Only**
+```bash
+# Install dependencies for local model inference
+pip install -r requirements-local.txt
+
+# Install PyTorch with CUDA support
+pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
+```
+
+**Option 4: Custom Installation**
+```bash
+# Install core packages manually
+pip install openai anthropic google-generativeai transformers accelerate six
+
+# Optional: For local models
+pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
+
+# Optional: For VLLM (see performance notes below)
+pip install vllm
+```
+
+### GPU Requirements
+
+- **Recommended**: RTX 3060+ with 6GB+ VRAM
+- **Minimum**: Any CUDA-compatible GPU with 4GB+ VRAM
+- **CPU fallback**: Supported but not recommended for performance
+
+### Requirements Files Explained
+
+The project includes several requirements files for different use cases:
+
+- **`requirements.txt`**: Complete installation with all features (recommended)
+  - Includes cloud providers, local models, data analysis tools
+  - Best for full functionality and development
+
+- **`requirements-minimal.txt`**: Minimal installation for cloud providers only
+  - OpenAI, Anthropic, Google Gemini support
+  - Smallest footprint, fastest installation
+  - Best for cloud-only usage
+
+- **`requirements-local.txt`**: Specialized for local model inference
+  - HuggingFace transformers, VLLM, GPU utilities
+  - Includes model optimization tools
+  - Best for privacy-focused or offline usage
+
+Choose the requirements file that matches your intended usage pattern.
+
+## Quick Start
+
+### Basic Usage
+
+```python
+import asyncio
+from loader import create_loader
+import json
+
+async def main():
+    # Create a loader for any provider
+    loader = create_loader("openai", solver_model="gpt-4o-mini", grader_model="gpt-4o-mini")
+    
+    # Load a problem
+    with open("dataset/2000-A-1.json") as f:
+        problem = json.load(f)
+    
+    # Test the problem
+    result = await loader.test_single_problem(problem, variant_type="original")
+    print(f"Grade: {result['final_grade']}")
+    print(f"Solution: {result['solution']}")
+
+asyncio.run(main())
+```
+
+### Command Line Usage
+
+```bash
+# Activate environment first
+conda activate putnam-local
+
+# Use the unified CLI for all operations
+python putnam_cli.py info                    # Show system info
+python putnam_cli.py health --provider openai
+python putnam_cli.py test --provider openai
+python putnam_cli.py solve dataset/2000-A-1.json --provider openai
+python putnam_cli.py batch dataset/ --provider openai --max-files 10
+python putnam_cli.py multi-test --provider openai --max-files 50 --concurrent 100
+```
+
+## Multi-Provider Support
+
+The system supports **8 AI providers** with a unified interface:
+
+### Cloud Providers
+
+#### 1. **OpenAI**
+- **Models**: GPT-4o, GPT-4o-mini, o1, o3
+- **Best for**: High-quality solutions and grading
+- **Setup**: Requires `OPENAI_API_KEY` environment variable
+
+#### 2. **Anthropic**
+- **Models**: Claude-3.5-Sonnet, Claude-3.5-Haiku
+- **Best for**: Detailed reasoning and explanation
+- **Setup**: Requires `ANTHROPIC_API_KEY` environment variable
+
+#### 3. **Google Gemini**
+- **Models**: Gemini-1.5-Pro, Gemini-1.5-Flash
+- **Best for**: Cost-effective high-performance solving
+- **Setup**: Requires `GOOGLE_API_KEY` environment variable
+
+#### 4. **xAI**
+- **Models**: Grok-3, Grok-2
+- **Best for**: Advanced reasoning and mathematical problem solving
+- **Setup**: Requires `XAI_API_KEY` environment variable
+
+#### 5. **Kimi (Moonshot AI)**
+- **Models**: moonshot-v1-8k, moonshot-v1-32k, moonshot-v1-128k
+- **Best for**: Chinese and English mathematical problem solving
+- **Setup**: Requires `MOONSHOT_API_KEY` environment variable
+
+### Local Providers
+
+#### 6. **HuggingFace**
+- **Models**: Any HuggingFace model (tested: GPT-2, DialoGPT)
+- **Performance**: Fast loading (~40s first time, then cached)
+- **Cost**: Free after setup
+- **Privacy**: Complete local inference
+- **Best for**: Development, testing, cost-sensitive applications
+
+#### 7. **VLLM Server**
+- **Models**: Any VLLM-compatible model
+- **Best for**: Multi-user server deployment
+- **Setup**: Requires running separate VLLM server
+- **Performance**: Good for sustained workloads
+
+#### 8. **VLLM Direct** ⚠️
+- **Models**: Any VLLM-compatible model
+- **Performance**: **NOT RECOMMENDED** - 72+ second initialization
+- **Issue**: Extremely slow first load due to graph compilation
+- **Use case**: Only for research/benchmarking VLLM internals
+
+```python
+# NOT RECOMMENDED for interactive use
+loader = create_loader("vllm_direct", solver_model="gpt2")  # Takes 72+ seconds!
+```
+
+## Performance Test Results
+
+Based on testing with RTX 3060 Laptop GPU (6GB VRAM):
+
+| Provider | First Load | Subsequent | GPU Memory | Recommendation |
+|----------|------------|-----------|------------|----------------|
+| API Clients | 1-2s | 1-2s | 0GB (cloud) | ⭐ **Best** |
+| HuggingFace | ~40s | <1s | 0.27GB | ⭐ **Excellent** |
+| VLLM Server | ~30s | <1s | Variable | ✅ Good |
+| VLLM Direct | **72+s** | <1s | 0.24GB | ❌ **Avoid** |
+
+## Configuration Examples
+
+### HuggingFace Local Setup
+
+```python
+# GPU inference (recommended)
+loader = create_loader(
+    "huggingface",
+    solver_model="gpt2",
+    grader_model="gpt2",
+    device="cuda"
+)
+
+# CPU inference (slower)
+loader = create_loader(
+    "huggingface",
+    solver_model="gpt2",
+    grader_model="gpt2",
+    device="cpu"
+)
+```
+
+### OpenAI Configuration
+
+```python
+# High-quality setup
+loader = create_loader(
+    "openai",
+    solver_model="gpt-4o-mini",
+    grader_model="o3"
+)
+```
+
+### Kimi Configuration
+
+```python
+# Standard 8k context window
+loader = create_loader(
+    "kimi",
+    solver_model="moonshot-v1-8k",
+    grader_model="moonshot-v1-8k"
+)
+
+# Large context window for complex problems
+loader = create_loader(
+    "kimi",
+    solver_model="moonshot-v1-128k",
+    grader_model="moonshot-v1-128k"
+)
+```
+
+### VLLM Server Setup
+
+```bash
+# Start VLLM server first (separate terminal):
+conda activate putnam-local
+vllm serve meta-llama/Llama-3.2-8B-Instruct --port 8000
+```
+
+```python
+# Then use the client
+loader = create_loader(
+    "vllm",
+    solver_model="meta-llama/Llama-3.2-8B-Instruct",
+    base_url="http://localhost:8000/v1"
+)
+```
+
+## Recommended Usage Patterns
+
+### For Development/Testing
+```python
+# Fast iteration with local models
+loader = create_loader("huggingface", solver_model="gpt2", device="cuda")
+```
+
+### For Production/Important Results
+```python
+# High-quality cloud models
+loader = create_loader("openai", solver_model="gpt-4o-mini", grader_model="o3")
+loader = create_loader("anthropic", solver_model="claude-3-5-sonnet")
+```
+
+### For Privacy-Sensitive Work
+```python
+# Completely local inference
+loader = create_loader("huggingface", solver_model="gpt2", device="cuda")
+```
+
+### For Cost Optimization
+```python
+# Free local models after setup
+loader = create_loader("huggingface", solver_model="gpt2", device="cuda")
+
+# Or cost-effective cloud
+loader = create_loader("openai", solver_model="gpt-4o-mini")
+```
+
+## Dataset Structure
+
+The dataset contains Putnam Competition problems in JSON format:
+
+```json
+{
+    "original": {
+        "problem_statement": "Mathematical problem...",
+        "solution": "Step-by-step solution...",
+        "problem_type": "proof"
+    },
+    "descriptive_long": {...},
+    "kernel_variant": {...}
+}
+```
+
+### Problem Variants
+
+- **original**: The standard problem statement
+- **descriptive_long**: More verbose problem description
+- **descriptive_long_confusing**: Intentionally confusing wording
+- **descriptive_long_misleading**: Misleading formulation
+- **garbled_string**: Corrupted text version
+- **kernel_variant**: Minimal essential formulation
+
+## Advanced Usage
+
+### GPU Memory Management
+
+```python
+# Check GPU status
+import torch
+if torch.cuda.is_available():
+    print(f"GPU: {torch.cuda.get_device_name(0)}")
+    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")
+```
+
+### Cost Estimation
+
+```python
+# Estimate costs before processing
+cost_info = await loader.estimate_cost(
+    num_problems=100,
+    avg_problem_length=1000,
+    avg_solution_length=2000
+)
+
+print(f"Estimated cost: ${cost_info['total_cost']:.2f}")
+print(f"Cost per problem: ${cost_info['cost_per_problem']:.4f}")
+```
+
+### Batch Processing
+
+```python
+async def process_dataset(dataset_path, provider="openai"):
+    loader = create_loader(provider)
+    
+    problems = []
+    for file_path in Path(dataset_path).glob("*.json"):
+        with open(file_path) as f:
+            problems.append(json.load(f))
+    
+    # Process problems concurrently
+    tasks = [
+        loader.test_single_problem(problem, variant_type="original")
+        for problem in problems
+    ]
+    
+    results = await asyncio.gather(*tasks, return_exceptions=True)
+    return results
+```
+
+#### Incremental Saving and Resume Support
+
+The batch evaluation now supports incremental saving and resume functionality:
+
+```bash
+# Start a new evaluation with incremental saving
+python putnam_cli.py batch dataset/ --provider openai --output results/my_results.json
+
+# If interrupted, resume from checkpoint (no other arguments needed!)
+python putnam_cli.py batch --resume results/checkpoint_my_results_20240101_120000.json
+
+# The checkpoint file is automatically created and updated after each problem
+# It contains all progress and can be used to resume if the process is interrupted
+```
+
+**Features:**
+- **Automatic Checkpointing**: Results are saved after each problem completes
+- **Resume Support**: Continue from where you left off if interrupted
+- **Backward Compatibility**: Existing checkpoint files continue to work
+- **Atomic Saves**: Uses temporary files to ensure data integrity
+- **Progress Tracking**: Shows saved/completed problems in real-time
+- **Automatic Cleanup**: Checkpoint files are removed after successful completion
+
+**Resume Usage:**
+```bash
+# For new checkpoint files (created after this update)
+python putnam_cli.py batch --resume checkpoint_file.json
+
+# For existing checkpoint files (created before this update)
+python putnam_cli.py batch dataset/ --provider openai --resume old_checkpoint_file.json
+```
+
+The system automatically detects checkpoint format and handles both old and new formats seamlessly.
+
+### Multi-Variant Testing
+
+Test all 6 problem variants with a single command using the unified CLI:
+
+```bash
+# Test all variants with OpenAI (50 files per variant)
+python putnam_cli.py multi-test --provider openai --max-files 50
+
+# Test specific variants only
+python putnam_cli.py multi-test --provider anthropic --variants original kernel_variant --max-files 25
+
+# Test with custom models and high concurrency
+python putnam_cli.py multi-test --provider openai --solver-model gpt-4.1-nano --grader-model o3 --max-files 1051 --concurrent 1100
+
+# Test with local models
+python putnam_cli.py multi-test --provider huggingface --device cuda --max-files 100
+
+# Test with VLLM server
+python putnam_cli.py multi-test --provider vllm --vllm-url http://localhost:8000/v1 --max-files 50
+```
+
+**Available Problem Variants:**
+- `original` - Standard mathematical problems
+- `descriptive_long` - Clear variable renaming
+- `descriptive_long_confusing` - Random unrelated words (marshmallow, armadillo, etc.)
+- `descriptive_long_misleading` - Misleading variable names (nonpositiveterm, negativeinitial, etc.)
+- `garbled_string` - Completely scrambled variable names  
+- `kernel_variant` - Simplified core mathematical version
+
+**Output Structure:**
+```
+multi_variant_results/
+├── openai_gpt-4o-mini_20241201_143022/
+│   ├── original_20241201_143022.json
+│   ├── descriptive_long_20241201_143022.json
+│   ├── ...
+│   └── SUMMARY_openai_gpt-4o-mini_20241201_143022.json
+└── multi_config_comparison_20241201_143022.json
+```
+
+## Troubleshooting
+
+### CUDA Issues
+```bash
+# Check CUDA compatibility
+python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"
+
+# If CUDA issues, reinstall PyTorch
+pip uninstall torch torchvision torchaudio -y
+pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
+```
+
+### Memory Issues
+```python
+# Clear GPU cache
+import torch
+if torch.cuda.is_available():
+    torch.cuda.empty_cache()
+```
+
+### VLLM Performance Issues
+- **Problem**: VLLM Direct takes 72+ seconds to load
+- **Solution**: Use HuggingFace local or VLLM server instead
+- **Alternative**: Use cloud providers for better speed
+
+## Testing
+
+```bash
+# Activate environment
+conda activate putnam-local
+
+# Quick health checks
+python -c "
+import asyncio
+from loader import create_loader
+
+async def test():
+    # Test available providers
+    providers = [
+        ('openai', 'gpt-4o-mini'), 
+        ('huggingface', 'gpt2'),
+        ('anthropic', 'claude-3-5-haiku'),
+        ('kimi', 'moonshot-v1-8k')
+    ]
+    for provider, model in providers:
+        try:
+            loader = create_loader(provider, solver_model=model)
+            result = await loader.health_check()
+            print(f'{provider}: {"✅ Ready" if result else "❌ Failed"}')
+        except Exception as e:
+            print(f'{provider}: ❌ Error - {e}')
+
+asyncio.run(test())
+"
+```
+
+## Performance Recommendations
+
+### For Speed ⚡
+1. **API Clients**: Sub-2s response (OpenAI, Anthropic, xAI, Gemini)
+2. **HuggingFace**: Fast after first load, completely local
+3. **VLLM Server**: Good for sustained workloads
+
+### For Cost 💰
+1. **HuggingFace**: Free after setup, local inference
+2. **OpenAI GPT-4o-mini**: Cost-effective cloud option
+3. **Gemini Flash**: Good price/performance ratio
+
+### For Privacy 🔒
+1. **HuggingFace**: Complete local inference
+2. **VLLM Server**: Local deployment with server architecture
+
+### For Quality 🎯
+1. **OpenAI o3**: Excellent grading capability
+2. **Claude-3.5-Sonnet**: Detailed explanations
+3. **Gemini Pro**: Strong mathematical reasoning
+
+## API Reference
+
+### Supported Providers
+- `openai`: OpenAI GPT models
+- `anthropic`: Anthropic Claude models  
+- `gemini`: Google Gemini models
+- `xai`: xAI Grok models
+- `kimi`: Kimi/Moonshot models
+- `vllm`: VLLM server models
+- `vllm_direct`: Direct VLLM API (slow)
+- `huggingface`: HuggingFace transformers
+
+### Factory Function
+
+```python
+create_loader(provider: str, **kwargs) -> ModelLoader
+```
+
+### Key Methods
+
+```python
+# Test a single problem
+await loader.test_single_problem(problem_data, variant_type)
+
+# Health check
+await loader.health_check()
+
+# Cost estimation
+await loader.estimate_cost(num_problems)
+
+# Model information
+loader.get_model_info()
+```
+
+## Contributing
+
+1. Fork the repository
+2. Create a feature branch (`git checkout -b feature/amazing-feature`)
+3. Test with the putnam-local environment
+4. Add tests for new functionality
+5. Submit a pull request
+
+## License
+
+[License information]
+
+## Citation
+
+If you use this code in your research, please cite:
+
+```bibtex
+@misc{putnam-bench-anon,
+    title={Putnam Mathematical Problem Solver},
+    year={2024},
+    howpublished={\url{<repository-url>}}
+}
+```
+\ No newline at end of file
diff --git a/putnam-bench-anon/calibrate_to_o3.py b/putnam-bench-anon/calibrate_to_o3.py
new file mode 100644
index 0000000..d2373cc
--- /dev/null
+++ b/putnam-bench-anon/calibrate_to_o3.py
@@ -0,0 +1,335 @@
+#!/usr/bin/env python3
+"""
+calibrate_to_o3.py – End-to-end pipeline that
+1. ingests existing o4-mini grading results for multiple models,
+2. draws a budget-constrained stratified sample,
+3. (optionally) re-grades those samples with o3 to obtain gold labels,
+4. learns per-stratum error rates and calibrates all o4 labels to the o3 scale,
+5. outputs required artefacts:
+   – sample_list.csv
+   – o3_raw.parquet (only when --run-o3)
+   – calibrated_o3_scores.csv
+
+Run:
+    python calibrate_to_o3.py                # stop after sampling only
+    python calibrate_to_o3.py --run-o3       # also call o3 re-grader
+
+"""
+
+from __future__ import annotations
+import argparse
+import asyncio
+import json
+import logging
+import math
+import random
+from pathlib import Path
+from typing import Dict, List, Tuple, Any
+
+import numpy as np
+import pandas as pd
+from scipy.stats import norm
+
+# Third-party library used by --run-o3 mode
+try:
+    from loader.openai_client import OpenAIModelLoader  # type: ignore
+except ModuleNotFoundError:
+    OpenAIModelLoader = None  # graceful degradation when running sampling-only mode
+
+###############################################################################
+# Constants – adjust here if the budget or cost model ever changes
+###############################################################################
+COST_PER_RECORD = 0.154  # USD per o3 grading request
+BUDGET_MAX = 800.0       # USD hard cap
+N_MAX = math.floor(BUDGET_MAX / COST_PER_RECORD)  # 5194 with default params
+SEED = 42
+MIN_PER_LAYER = 10
+
+############################# Logging setup ###################################
+logging.basicConfig(
+    level=logging.INFO,
+    format="%(asctime)s │ %(levelname)-8s │ %(message)s",
+    datefmt="%H:%M:%S",
+)
+LOGGER = logging.getLogger("calibrate")
+
+###############################################################################
+# Utility functions
+###############################################################################
+
+def wilson_ci(k: float, n: int, conf: float = 0.95) -> Tuple[float, float]:
+    """Wilson score interval for a proportion.
+    k may be fractional (calibrated successes). Returns (low, high)."""
+    if n == 0:
+        return 0.0, 0.0
+    z = norm.ppf(1 - (1 - conf) / 2)
+    p_hat = k / n
+    denom = 1 + z ** 2 / n
+    centre = (p_hat + z ** 2 / (2 * n)) / denom
+    half_width = (
+        z
+        * math.sqrt((p_hat * (1 - p_hat) + z ** 2 / (4 * n)) / n)
+        / denom
+    )
+    return max(0.0, centre - half_width), min(1.0, centre + half_width)
+
+
+def parse_diff(index: str) -> int:
+    """Extract trailing difficulty digit (1-6) from an index like 2024-B-6."""
+    try:
+        return int(index.split("-")[-1])
+    except (ValueError, IndexError):
+        return -1  # fallback – will be filtered out later
+
+###############################################################################
+# 1 Load meta-data from dataset/*.json – mapping from problem index to (type,diff)
+###############################################################################
+
+def load_dataset_metadata(dataset_dir: Path) -> Dict[str, Tuple[str, int]]:
+    mapping: Dict[str, Tuple[str, int]] = {}
+    json_files = sorted(dataset_dir.glob("*.json"))
+    for fp in json_files:
+        try:
+            with fp.open("r", encoding="utf-8") as f:
+                data = json.load(f)
+            idx = data.get("index")
+            typ = data.get("type")
+            diff = parse_diff(idx)
+            if idx and typ and diff != -1:
+                mapping[idx] = (typ, diff)
+        except Exception as e:
+            LOGGER.warning(f"Failed to parse {fp}: {e}")
+    LOGGER.info(f"Loaded metadata for {len(mapping):,} problems from dataset")
+    return mapping
+
+###############################################################################
+# 2 Load all o4-mini result JSONs into one DataFrame
+###############################################################################
+
+def load_o4_results(results_root: Path, meta: Dict[str, Tuple[str, int]]) -> pd.DataFrame:
+    rows: List[Dict[str, Any]] = []
+    model_dirs = [d for d in results_root.iterdir() if d.is_dir()]
+    for model_dir in model_dirs:
+        model_id = model_dir.name
+        # consider only *_original.json for uniformity
+        for fp in model_dir.glob("*original.json"):
+            try:
+                with fp.open("r", encoding="utf-8") as f:
+                    res = json.load(f)
+                for pr in res.get("problems", []):
+                    idx = pr.get("index")
+                    grade_info = pr.get("grade", {})
+                    o4_score = int(grade_info.get("grade") == "CORRECT")
+                    # meta info
+                    typ, diff = meta.get(idx, (None, None))
+                    if typ is None:
+                        continue  # skip problems without meta
+                    row = {
+                        "id": idx,
+                        "model_id": model_id,
+                        "type": typ,
+                        "diff": diff,
+                        "o4_score": o4_score,
+                        # Extra fields useful for optional o3 grading
+                        "student_solution": pr.get("solve", {}).get("solution", ""),
+                    }
+                    rows.append(row)
+            except Exception as e:
+                LOGGER.warning(f"Failed to process {fp}: {e}")
+    df = pd.DataFrame(rows)
+    LOGGER.info(f"Ingested {len(df):,} problem-model pairs across {df['model_id'].nunique()} models")
+    return df
+
+###############################################################################
+# 3 Stratified sampling under budget
+###############################################################################
+
+def stratified_sample(df: pd.DataFrame) -> pd.DataFrame:
+    rng = np.random.default_rng(SEED)
+    group_cols = ["type", "diff", "o4_score"]
+
+    # Compute desired sample sizes per layer
+    layer_counts = df.groupby(group_cols, observed=True).size().rename("N_k")
+    total_records = len(df)
+    target_sizes = (
+        (layer_counts / total_records * N_MAX).apply(np.ceil).astype(int).clip(lower=MIN_PER_LAYER)
+    )
+
+    # If the initial allocation exceeds budget, scale down proportionally (but keep >=MIN_PER_LAYER)
+    total_target = target_sizes.sum()
+    if total_target > N_MAX:
+        LOGGER.info(
+            f"Initial allocation {total_target} exceeds N_MAX={N_MAX}. Scaling down proportionally."
+        )
+        scaling = (N_MAX - MIN_PER_LAYER * target_sizes.size) / (
+            total_target - MIN_PER_LAYER * target_sizes.size
+        )
+        scaling = max(scaling, 0.0)
+        target_sizes = (
+            MIN_PER_LAYER
+            + np.floor((target_sizes - MIN_PER_LAYER) * scaling).astype(int)
+        )
+    LOGGER.info(
+        f"Final per-stratum sample sizes prepared (sum={target_sizes.sum()}) – within budget"
+    )
+
+    # Actual sampling
+    samples = []
+    for key, group in df.groupby(group_cols, observed=True):
+        n = min(target_sizes.get(key, MIN_PER_LAYER), len(group))
+        if n <= 0:
+            continue
+        sample_idx = rng.choice(group.index.to_numpy(), size=n, replace=False)
+        samples.append(df.loc[sample_idx])
+    sample_df = pd.concat(samples, ignore_index=True)
+    LOGGER.info(f"Sampled {len(sample_df):,} rows in total (<= {N_MAX})")
+    return sample_df
+
+###############################################################################
+# 4 Async o3 re-grading helper
+###############################################################################
+
+async def grade_with_o3(sample_df: pd.DataFrame, meta: Dict[str, Tuple[str, int]]) -> pd.Series:
+    """Returns pd.Series of int o3_score aligned with sample_df.index."""
+    if OpenAIModelLoader is None:
+        raise RuntimeError("OpenAIModelLoader not available. Install dependencies or run without --run-o3.")
+
+    async with OpenAIModelLoader(solver_model="o3", grader_model="o3") as loader:
+
+        async def grade_one(row) -> int:
+            idx = row.id
+            question = None
+            reference_solution = None
+            # load dataset file lazily when needed
+            dataset_file = Path("dataset") / f"{idx}.json"
+            if dataset_file.exists():
+                try:
+                    with dataset_file.open("r", encoding="utf-8") as f:
+                        data = json.load(f)
+                    question = data.get("question", "")
+                    reference_solution = data.get("solution", "")
+                except Exception:
+                    pass
+            if not question:
+                return -1  # cannot grade
+            student_solution = row.student_solution or ""
+            try:
+                grade_result, _ = await loader.grade_solution(
+                    question,
+                    student_solution,
+                    reference_solution,
+                    problem_type="proof",
+                    model="o3",
+                )
+                return int(grade_result.get("grade") == "CORRECT") if grade_result else -1
+            except Exception as exc:
+                LOGGER.warning(f"o3 grading failed for {idx}: {exc}")
+                return -1
+
+        sem = asyncio.Semaphore(20)
+        async def sem_grade(row):
+            async with sem:
+                return await grade_one(row)
+
+        tasks = [asyncio.create_task(sem_grade(row)) for _, row in sample_df.iterrows()]
+        o3_scores = await asyncio.gather(*tasks)
+        return pd.Series(o3_scores, index=sample_df.index, name="o3_score")
+
+###############################################################################
+# 5 Calibration – compute per-stratum error rates and apply
+###############################################################################
+
+def compute_error_rates(sample_df: pd.DataFrame) -> pd.DataFrame:
+    group_cols = ["type", "diff"]
+
+    # Build contingency counts per stratum
+    counts = sample_df.groupby(group_cols + ["o4_score", "o3_score"], observed=True).size().unstack(fill_value=0)
+    # Ensure o3_score columns 0 and 1 exist
+    for col in [0, 1]:
+        if col not in counts.columns:
+            counts[col] = 0
+    # counts index columns: type, diff, o4_score
+    # Compute p1_k and p0_k
+    records = []
+    for (typ, diff, o4_val), row in counts.reset_index().groupby(["type", "diff", "o4_score"], observed=True):
+        n = row[[0, 1]].sum(axis=1).values[0]
+        k = row[0].values[0]  # for p1 or p0 depends
+        if o4_val == 1:  # looking at false positives (o4=1 but o3=0)
+            p1 = k / n if n else 0.10
+            records.append({"type": typ, "diff": diff, "p1": p1})
+        else:  # o4=0
+            p0 = row[1].values[0] / n if n else 0.10
+            records.append({"type": typ, "diff": diff, "p0": p0})
+    errs = pd.DataFrame(records).groupby(["type", "diff"], observed=True).first().reset_index()
+    errs["p1"].fillna(0.10, inplace=True)
+    errs["p0"].fillna(0.10, inplace=True)
+    return errs
+
+
+def apply_calibration(full_df: pd.DataFrame, err_df: pd.DataFrame) -> pd.Series:
+    merged = full_df.merge(err_df, on=["type", "diff"], how="left")
+    merged["p1"].fillna(0.10, inplace=True)
+    merged["p0"].fillna(0.10, inplace=True)
+    est = np.where(merged.o4_score == 1, 1 - merged.p1, merged.p0)
+    return pd.Series(est, index=full_df.index, name="o3_est")
+
+###############################################################################
+# 6 Main entry
+###############################################################################
+
+def main():
+    parser = argparse.ArgumentParser(description="Calibrate o4-mini results to o3 scale")
+    parser.add_argument("--run-o3", action="store_true", help="Actually call o3 to grade the sampled pairs")
+    parser.add_argument("--output-dir", default="calibration_out", help="Directory to store generated artefacts")
+    args = parser.parse_args()
+
+    out_dir = Path(args.output_dir)
+    out_dir.mkdir(parents=True, exist_ok=True)
+
+    # 1 Load meta and results
+    meta = load_dataset_metadata(Path("dataset"))
+    full_df = load_o4_results(Path("results"), meta)
+
+    # 2 Sampling
+    sample_df = stratified_sample(full_df)
+    sample_df.to_csv(out_dir / "sample_list.csv", index=False)
+
+    if args.run_o3:
+        LOGGER.info("Starting o3 re-grading – this may incur cost!")
+        start = asyncio.run(grade_with_o3(sample_df, meta))
+        sample_df["o3_score"] = start
+        sample_df.to_parquet(out_dir / "o3_raw.parquet", index=False)
+        spent = sample_df["o3_score"].notna().sum() * COST_PER_RECORD
+        LOGGER.info(f"o3 grading finished. Cost ≈ ${spent:.2f}")
+    else:
+        LOGGER.info("--run-o3 not provided; skipping API calls and downstream calibration")
+        return  # exit early
+
+    # 3 Calibration
+    err_df = compute_error_rates(sample_df)
+    full_df["o3_est"] = apply_calibration(full_df, err_df)
+
+    # 4 Aggregate per model
+    agg_rows = []
+    for model_id, grp in full_df.groupby("model_id", observed=True):
+        mean_est = grp.o3_est.mean()
+        n = len(grp)
+        k_hat = mean_est * n
+        ci_low, ci_high = wilson_ci(k_hat, n)
+        agg_rows.append({
+            "model_id": model_id,
+            "mean": mean_est,
+            "ci_low": ci_low,
+            "ci_high": ci_high,
+        })
+    agg_df = pd.DataFrame(agg_rows)
+    agg_df.to_csv(out_dir / "calibrated_o3_scores.csv", index=False)
+    LOGGER.info("Calibration finished. Artefacts saved to %s", out_dir)
+
+
+if __name__ == "__main__":
+    try:
+        main()
+    except Exception as exc:
+        LOGGER.error("Fatal error: %s", exc)
+        raise 
+\ No newline at end of file
diff --git a/putnam-bench-anon/docs/openrouter_usage.md b/putnam-bench-anon/docs/openrouter_usage.md
new file mode 100644
index 0000000..85e5712
--- /dev/null
+++ b/putnam-bench-anon/docs/openrouter_usage.md
@@ -0,0 +1,143 @@
+# OpenRouter Integration Guide
+
+## Overview
+
+OpenRouter provides access to multiple AI model providers through a single API endpoint. This integration allows you to use models from OpenAI, Anthropic, Google, Meta, Mistral, and many other providers with a unified interface.
+
+## Setup
+
+### 1. Get an API Key
+
+Sign up at [OpenRouter.ai](https://openrouter.ai) to get your API key.
+
+### 2. Set Environment Variable
+
+```bash
+export OPENROUTER_API_KEY="your-api-key-here"
+```
+
+## Usage
+
+### Basic Usage
+
+```python
+from loader import create_loader
+
+# Create OpenRouter loader with default models
+loader = create_loader("openrouter")
+
+# Or specify custom models
+loader = create_loader(
+    "openrouter",
+    solver_model="openai/gpt-4o",
+    grader_model="anthropic/claude-3-opus"
+)
+```
+
+### Direct Instantiation
+
+```python
+from loader import OpenRouterModelLoader
+
+loader = OpenRouterModelLoader(
+    solver_model="openai/gpt-4o-mini",
+    grader_model="openai/gpt-4o",
+    api_key="your-api-key",  # Optional, uses env var if not provided
+    site_url="https://yoursite.com",  # Optional, for rankings
+    site_name="Your Site Name"  # Optional, for rankings
+)
+```
+
+### Using Cross-Provider Features
+
+One of the key advantages of OpenRouter is the ability to mix models from different providers:
+
+```python
+# Use OpenAI for solving and Anthropic for grading
+loader = create_loader(
+    "openrouter",
+    solver_model="openai/gpt-4o",
+    grader_model="anthropic/claude-3-opus"
+)
+
+# Use Google's Gemini for solving and Meta's Llama for grading
+loader = create_loader(
+    "openrouter",
+    solver_model="google/gemini-pro",
+    grader_model="meta-llama/llama-3-70b-instruct"
+)
+```
+
+## Available Models
+
+OpenRouter supports a wide variety of models. Here are some popular options:
+
+### OpenAI Models
+- `openai/gpt-4o` - GPT-4 Optimized
+- `openai/gpt-4o-mini` - Smaller, faster GPT-4
+- `openai/gpt-4-turbo` - GPT-4 Turbo
+- `openai/gpt-3.5-turbo` - GPT-3.5 Turbo
+- `openai/o1-preview` - O1 Preview (reasoning model)
+- `openai/o1-mini` - O1 Mini
+
+### Anthropic Models
+- `anthropic/claude-3-opus` - Most capable Claude model
+- `anthropic/claude-3-sonnet` - Balanced performance
+- `anthropic/claude-3-haiku` - Fastest Claude model
+- `anthropic/claude-2.1` - Previous generation
+- `anthropic/claude-2` - Previous generation
+
+### Google Models
+- `google/gemini-pro` - Gemini Pro
+- `google/gemini-pro-vision` - Gemini Pro with vision
+- `google/palm-2-codechat-bison` - PaLM 2 for code
+- `google/palm-2-chat-bison` - PaLM 2 for chat
+
+### Meta Models
+- `meta-llama/llama-3-70b-instruct` - Llama 3 70B
+- `meta-llama/llama-3-8b-instruct` - Llama 3 8B
+- `meta-llama/codellama-70b-instruct` - CodeLlama 70B
+
+### Mistral Models
+- `mistralai/mistral-large` - Mistral Large
+- `mistralai/mistral-medium` - Mistral Medium
+- `mistralai/mistral-small` - Mistral Small
+- `mistralai/mixtral-8x7b-instruct` - Mixtral MoE
+
+### Other Models
+- `cohere/command-r-plus` - Cohere Command R+
+- `deepseek/deepseek-coder` - DeepSeek Coder
+- `qwen/qwen-2-72b-instruct` - Qwen 2 72B
+
+For a complete and up-to-date list, visit: https://openrouter.ai/models
+
+## Testing
+
+Run the test script to verify your setup:
+
+```bash
+python test_openrouter.py
+```
+
+## Cost Considerations
+
+Different models have different pricing. Generally:
+- Mini/small models are cheapest
+- Standard models are moderately priced
+- Large/opus models are most expensive
+
+Check [OpenRouter pricing](https://openrouter.ai/models) for current rates.
+
+## Troubleshooting
+
+### API Key Issues
+- Ensure `OPENROUTER_API_KEY` is set correctly
+- Check that your API key has sufficient credits
+
+### Model Availability
+- Some models may have limited availability
+- Check OpenRouter status page if a model isn't responding
+
+### Rate Limits
+- OpenRouter has rate limits that vary by model
+- The loader includes automatic retry logic for rate limit errors 
+\ No newline at end of file
diff --git a/putnam-bench-anon/examples/openrouter_example.py b/putnam-bench-anon/examples/openrouter_example.py
new file mode 100644
index 0000000..bc75877
--- /dev/null
+++ b/putnam-bench-anon/examples/openrouter_example.py
@@ -0,0 +1,157 @@
+#!/usr/bin/env python3
+"""
+Example of using OpenRouter with putnam-bench to solve mathematical problems.
+
+This example demonstrates:
+1. Using different model combinations from different providers
+2. Solving a real problem from the dataset
+3. Comparing results across different models
+"""
+
+import asyncio
+import json
+import os
+from loader import create_loader
+
+async def solve_with_openrouter():
+    """Example of solving a Putnam problem using OpenRouter."""
+    
+    # Check API key
+    if not os.getenv('OPENROUTER_API_KEY'):
+        print("❌ Please set OPENROUTER_API_KEY environment variable")
+        return
+    
+    # Load a sample problem
+    problem_file = "dataset/1938-A-1.json"
+    if not os.path.exists(problem_file):
+        print(f"❌ Problem file not found: {problem_file}")
+        print("   Make sure you're running from the project root directory")
+        return
+    
+    with open(problem_file) as f:
+        problem_data = json.load(f)
+    
+    print(f"📚 Problem: {problem_data['problem_statement'][:100]}...")
+    print(f"   Type: {problem_data['problem_type']}")
+    print(f"   Year: {problem_data['year']}")
+    
+    # Test with different model combinations
+    test_configs = [
+        {
+            "name": "OpenAI Only",
+            "solver": "openai/gpt-4o-mini",
+            "grader": "openai/gpt-4o"
+        },
+        {
+            "name": "Mixed OpenAI/Anthropic",
+            "solver": "openai/gpt-4o",
+            "grader": "anthropic/claude-3-haiku"
+        },
+        {
+            "name": "Google Gemini",
+            "solver": "google/gemini-pro",
+            "grader": "google/gemini-pro"
+        }
+    ]
+    
+    for config in test_configs:
+        print(f"\n{'='*60}")
+        print(f"🧪 Testing: {config['name']}")
+        print(f"   Solver: {config['solver']}")
+        print(f"   Grader: {config['grader']}")
+        
+        try:
+            # Create loader with specific models
+            loader = create_loader(
+                "openrouter",
+                solver_model=config['solver'],
+                grader_model=config['grader'],
+                retries=3,
+                timeout_base=120
+            )
+            
+            # Solve the problem
+            print("\n⏳ Solving problem...")
+            solution, raw = await loader.solve_problem(problem_data['problem_statement'])
+            
+            if solution:
+                print("✅ Solution found!")
+                print(f"   Final answer: {solution.get('final_answer', 'N/A')}")
+                
+                # Grade the solution (if it's a proof problem)
+                if problem_data['problem_type'] == 'proof':
+                    print("\n⏳ Grading solution...")
+                    grade_result = await loader.grade_solution(
+                        problem_data['problem_statement'],
+                        solution['solution'],
+                        problem_data.get('ground_truth_solution', ''),
+                        problem_type='proof'
+                    )
+                    
+                    if grade_result:
+                        print(f"📊 Grade: {grade_result.get('score', 'N/A')}/10")
+                        print(f"   Reasoning: {grade_result.get('reasoning', 'N/A')[:100]}...")
+                else:
+                    print("   (Calculation problem - grading skipped)")
+            else:
+                print("❌ Failed to get solution")
+                
+        except Exception as e:
+            print(f"❌ Error: {type(e).__name__}: {e}")
+            
+    print(f"\n{'='*60}")
+    print("✅ Example completed!")
+
+async def list_recommended_models():
+    """List recommended model combinations for different use cases."""
+    
+    print("\n📋 Recommended OpenRouter Model Combinations:\n")
+    
+    recommendations = [
+        {
+            "use_case": "Best Quality (Expensive)",
+            "solver": "openai/gpt-4o",
+            "grader": "anthropic/claude-3-opus",
+            "notes": "Highest accuracy but most expensive"
+        },
+        {
+            "use_case": "Balanced Performance",
+            "solver": "openai/gpt-4o-mini",
+            "grader": "anthropic/claude-3-sonnet",
+            "notes": "Good balance of cost and performance"
+        },
+        {
+            "use_case": "Budget Friendly",
+            "solver": "openai/gpt-3.5-turbo",
+            "grader": "google/gemini-pro",
+            "notes": "Cheapest option, still decent quality"
+        },
+        {
+            "use_case": "Open Source Models",
+            "solver": "meta-llama/llama-3-70b-instruct",
+            "grader": "mistralai/mixtral-8x7b-instruct",
+            "notes": "Using open-source models only"
+        },
+        {
+            "use_case": "Code-Focused",
+            "solver": "deepseek/deepseek-coder",
+            "grader": "meta-llama/codellama-70b-instruct",
+            "notes": "Optimized for problems with code"
+        }
+    ]
+    
+    for rec in recommendations:
+        print(f"🎯 {rec['use_case']}")
+        print(f"   Solver: {rec['solver']}")
+        print(f"   Grader: {rec['grader']}")
+        print(f"   Notes: {rec['notes']}")
+        print()
+
+if __name__ == "__main__":
+    print("🚀 OpenRouter Example for Putnam Bench")
+    
+    # Run the example
+    asyncio.run(solve_with_openrouter())
+    
+    # Show recommendations
+    asyncio.run(list_recommended_models()) 
+\ No newline at end of file
diff --git a/putnam-bench-anon/install.py b/putnam-bench-anon/install.py
new file mode 100644
index 0000000..36812b9
--- /dev/null
+++ b/putnam-bench-anon/install.py
@@ -0,0 +1,240 @@
+#!/usr/bin/env python3
+"""
+Quick installation and setup script for Putnam Problem Solver.
+
+This script provides a one-command setup for the entire system.
+
+Usage:
+    python install.py                    # Full installation
+    python install.py --quick           # Quick setup (minimal dependencies)
+    python install.py --check-only      # Just check what would be installed
+"""
+
+import asyncio
+import sys
+import subprocess
+import os
+from pathlib import Path
+import argparse
+
+
+def print_banner():
+    """Print installation banner."""
+    print("🚀 Putnam Problem Solver - Quick Install")
+    print("=" * 50)
+    print("This script will set up everything you need to get started.")
+    print()
+
+
+def check_python():
+    """Check Python version."""
+    version = sys.version_info
+    if version.major < 3 or (version.major == 3 and version.minor < 8):
+        print("❌ Python 3.8+ required")
+        print(f"   Current version: {version.major}.{version.minor}.{version.micro}")
+        return False
+    
+    print(f"✅ Python {version.major}.{version.minor}.{version.micro}")
+    return True
+
+
+def install_packages(packages: list, check_only: bool = False):
+    """Install required packages."""
+    if not packages:
+        print("✅ All required packages are already installed!")
+        return True
+    
+    print(f"📦 {'Would install' if check_only else 'Installing'}: {', '.join(packages)}")
+    
+    if check_only:
+        return True
+    
+    try:
+        subprocess.run([
+            sys.executable, '-m', 'pip', 'install', '--upgrade'
+        ] + packages, check=True)
+        print("✅ Packages installed successfully!")
+        return True
+    except subprocess.CalledProcessError as e:
+        print(f"❌ Failed to install packages: {e}")
+        return False
+
+
+def check_package_installed(package: str) -> bool:
+    """Check if a package is installed."""
+    try:
+        if package == 'google-generativeai':
+            import google.generativeai
+        else:
+            __import__(package)
+        return True
+    except ImportError:
+        return False
+
+
+def get_missing_packages(include_optional: bool = True) -> list:
+    """Get list of missing packages."""
+    # Core packages (always required)
+    core_packages = ['openai', 'anthropic', 'google-generativeai', 'psutil']
+    
+    # Optional packages (for local models)
+    optional_packages = ['transformers', 'torch', 'vllm'] if include_optional else []
+    
+    missing = []
+    
+    print("🔍 Checking required packages...")
+    for package in core_packages:
+        if check_package_installed(package):
+            print(f"   ✅ {package}")
+        else:
+            print(f"   ❌ {package}")
+            missing.append(package)
+    
+    if include_optional:
+        print("\n🔍 Checking optional packages (for local models)...")
+        for package in optional_packages:
+            if check_package_installed(package):
+                print(f"   ✅ {package}")
+            else:
+                print(f"   ⚠️ {package} (optional)")
+                missing.append(package)
+    
+    return missing
+
+
+def create_alias():
+    """Create putnam command alias."""
+    script_dir = Path(__file__).parent.absolute()
+    putnam_cli = script_dir / "putnam_cli.py"
+    
+    # For Unix-like systems
+    if os.name != 'nt':
+        shell_profile = Path.home() / ".bashrc"
+        if not shell_profile.exists():
+            shell_profile = Path.home() / ".bash_profile"
+        if not shell_profile.exists():
+            shell_profile = Path.home() / ".zshrc"
+        
+        alias_line = f'alias putnam="python {putnam_cli}"'
+        
+        try:
+            # Check if alias already exists
+            if shell_profile.exists():
+                with open(shell_profile, 'r') as f:
+                    content = f.read()
+                    if 'alias putnam=' in content:
+                        print("✅ 'putnam' alias already exists")
+                        return True
+            
+            # Add alias
+            with open(shell_profile, 'a') as f:
+                f.write(f"\n# Putnam Problem Solver alias\n{alias_line}\n")
+            
+            print(f"✅ Added 'putnam' alias to {shell_profile}")
+            print(f"💡 Restart your shell or run: source {shell_profile}")
+            return True
+            
+        except Exception as e:
+            print(f"⚠️ Could not create alias: {e}")
+            print(f"💡 You can manually add: {alias_line}")
+    
+    else:
+        # Windows
+        print("💡 On Windows, use: python putnam_cli.py <command>")
+    
+    return False
+
+
+async def run_setup():
+    """Run the configuration setup."""
+    print("\n🛠️ Running configuration setup...")
+    try:
+        from setup_config import ConfigManager
+        manager = ConfigManager()
+        await manager.interactive_setup()
+        return True
+    except ImportError:
+        print("❌ Configuration setup not available")
+        return False
+    except Exception as e:
+        print(f"❌ Setup failed: {e}")
+        return False
+
+
+def print_next_steps():
+    """Print next steps for the user."""
+    print("\n🎉 Installation completed!")
+    print("\n📚 Next Steps:")
+    print("   1. Set up API keys: python setup_config.py")
+    print("   2. Check health: python health_check.py")
+    print("   3. Quick test: python putnam_cli.py test --provider openai")
+    print("   4. Solve a problem: python putnam_cli.py solve dataset/1938-A-1.json")
+    print("   5. Run benchmark: python benchmark.py --quick-test")
+    print("\n💡 Available Scripts:")
+    print("   • putnam_cli.py - Main CLI interface")
+    print("   • health_check.py - Check provider health")
+    print("   • batch_evaluate.py - Batch evaluation")
+    print("   • benchmark.py - Performance comparison")
+    print("   • setup_config.py - Configuration management")
+    print("   • local_models_example.py - Local model examples")
+    print("\n📖 Documentation: See README.md for detailed usage")
+
+
+async def main():
+    """Main installation function."""
+    parser = argparse.ArgumentParser(description="Install Putnam Problem Solver")
+    parser.add_argument("--quick", action="store_true", 
+                       help="Quick install (core packages only)")
+    parser.add_argument("--check-only", action="store_true",
+                       help="Check what would be installed without installing")
+    parser.add_argument("--no-setup", action="store_true",
+                       help="Skip interactive configuration setup")
+    parser.add_argument("--no-alias", action="store_true",
+                       help="Skip creating command alias")
+    
+    args = parser.parse_args()
+    
+    print_banner()
+    
+    # Check Python version
+    if not check_python():
+        return 1
+    
+    # Check packages
+    missing_packages = get_missing_packages(include_optional=not args.quick)
+    
+    if args.check_only:
+        if missing_packages:
+            print(f"\n📋 Would install: {', '.join(missing_packages)}")
+        else:
+            print("\n✅ All packages are already installed!")
+        return 0
+    
+    # Install packages
+    if missing_packages:
+        print(f"\n📦 Installing {len(missing_packages)} packages...")
+        if not install_packages(missing_packages):
+            return 1
+    else:
+        print("\n✅ All packages are already installed!")
+    
+    # Create alias
+    if not args.no_alias:
+        print("\n🔗 Creating command alias...")
+        create_alias()
+    
+    # Run configuration setup
+    if not args.no_setup:
+        if input("\n🛠️ Run configuration setup now? (y/n): ").lower().startswith('y'):
+            await run_setup()
+        else:
+            print("💡 You can run setup later with: python setup_config.py")
+    
+    # Print next steps
+    print_next_steps()
+    
+    return 0
+
+
+if __name__ == "__main__":
+    exit(asyncio.run(main())) 
+\ No newline at end of file
diff --git a/putnam-bench-anon/loader/__init__.py b/putnam-bench-anon/loader/__init__.py
new file mode 100644
index 0000000..f6e9cf9
--- /dev/null
+++ b/putnam-bench-anon/loader/__init__.py
@@ -0,0 +1,290 @@
+"""
+Model loader package for mathematical problem solving.
+
+This package provides a unified interface for loading and using different AI models
+to solve mathematical problems and grade solutions.
+
+Usage:
+    # Create an OpenAI loader
+    loader = create_loader("openai", solver_model="gpt-4o-mini", grader_model="o3")
+    
+    # Create an OpenRouter loader
+    loader = create_loader("openrouter", solver_model="openai/gpt-4o", grader_model="anthropic/claude-3-opus")
+    
+    # Or directly instantiate
+    from loader import OpenAIModelLoader
+    loader = OpenAIModelLoader()
+    
+    # Test a problem
+    import json
+    with open("dataset/1938-A-1.json") as f:
+        data = json.load(f)
+    
+    result = await loader.test_single_problem(data, variant_type="original")
+"""
+
+from .base import ModelLoader
+from .openai_client import OpenAIModelLoader, KimiModelLoader
+from .anthropic_client import AnthropicModelLoader
+from .gemini_client import GeminiModelLoader
+from .xai_client import XAIModelLoader
+from .openrouter_client import OpenRouterModelLoader
+from .vllm_local import VLLMModelLoader
+from .vllm_direct import VLLMDirectModelLoader
+from .hf_local import HuggingFaceModelLoader
+from .cross_provider import CrossProviderLoader
+from .prompts import (
+    SOLVER_SYSTEM_PROMPT,
+    SOLVER_USER_TEMPLATE,
+    PROOF_GRADER_SYSTEM_PROMPT,
+    CALCULATION_GRADER_SYSTEM_PROMPT,
+    PROOF_GRADER_USER_TEMPLATE,
+    CALCULATION_GRADER_USER_TEMPLATE,
+    RESPONSE_FORMAT,
+    DEFAULT_RETRIES,
+    DEFAULT_TIMEOUT_BASE
+)
+from typing import Optional
+
+def create_loader(provider: str, **kwargs) -> ModelLoader:
+    """
+    Factory function to create model loaders.
+    
+    Args:
+        provider: Provider name ("openai", "anthropic", "gemini", etc.)
+        **kwargs: Additional arguments passed to the loader constructor
+        
+    Returns:
+        ModelLoader instance
+        
+    Raises:
+        ValueError: If provider is not supported
+        
+    Examples:
+        # Create OpenAI loader
+        loader = create_loader("openai", solver_model="gpt-4o-mini", grader_model="o3")
+        
+        # Create loader with custom settings
+        loader = create_loader(
+            "openai", 
+            solver_model="gpt-4", 
+            grader_model="o3",
+            retries=5,
+            timeout_base=600
+        )
+    """
+    provider_lower = provider.lower()
+    
+    if provider_lower == "openai":
+        return OpenAIModelLoader(**kwargs)
+    elif provider_lower == "anthropic":
+        return AnthropicModelLoader(**kwargs)
+    elif provider_lower == "gemini":
+        return GeminiModelLoader(**kwargs)
+    elif provider_lower == "xai":
+        return XAIModelLoader(**kwargs)
+    elif provider_lower == "openrouter":
+        return OpenRouterModelLoader(**kwargs)
+    elif provider_lower == "kimi":
+        return KimiModelLoader(**kwargs)
+    elif provider_lower == "vllm":
+        return VLLMModelLoader(**kwargs)
+    elif provider_lower == "vllm_direct":
+        return VLLMDirectModelLoader(**kwargs)
+    elif provider_lower in ["huggingface", "hf"]:
+        return HuggingFaceModelLoader(**kwargs)
+    else:
+        supported = ["openai", "anthropic", "gemini", "xai", "openrouter", "kimi", "vllm", "vllm_direct", "huggingface"]
+        raise ValueError(f"Unsupported provider: {provider}. Supported providers: {supported}")
+
+
+def create_cross_provider_loader(
+    solver_provider: str,
+    grader_provider: Optional[str] = None,
+    solver_model: Optional[str] = None,
+    grader_model: Optional[str] = None,
+    **kwargs
+) -> ModelLoader:
+    """
+    Create a loader that can use different providers for solving and grading.
+    
+    Args:
+        solver_provider: Provider for solving problems
+        grader_provider: Provider for grading (if None, uses solver_provider)
+        solver_model: Override solver model
+        grader_model: Override grader model
+        **kwargs: Additional arguments (can include provider-specific settings)
+        
+    Returns:
+        CrossProviderLoader instance
+        
+    Examples:
+        # Use Kimi for solving and OpenAI for grading
+        loader = create_cross_provider_loader(
+            solver_provider="kimi",
+            grader_provider="openai",
+            solver_model="Kimi-K2-Instruct",
+            grader_model="o3"
+        )
+        
+        # Use same provider but different models
+        loader = create_cross_provider_loader(
+            solver_provider="openai",
+            solver_model="gpt-4o-mini",
+            grader_model="o3"
+        )
+    """
+    # Extract provider-specific kwargs
+    solver_kwargs = kwargs.pop('solver_kwargs', {})
+    grader_kwargs = kwargs.pop('grader_kwargs', {})
+    
+    # Extract common parameters that should be passed to both loaders
+    quick = kwargs.pop('quick', False)
+    debug = kwargs.pop('debug', False)
+    
+    # Add common parameters to both solver and grader kwargs
+    solver_kwargs.update({'quick': quick, 'debug': debug})
+    grader_kwargs.update({'quick': quick, 'debug': debug})
+    
+    # Get default models if not specified
+    if not solver_model:
+        solver_defaults = get_default_models(solver_provider)
+        solver_model = solver_defaults['solver_model']
+    
+    if not grader_provider:
+        grader_provider = solver_provider
+        
+    if not grader_model:
+        grader_defaults = get_default_models(grader_provider)
+        grader_model = grader_defaults['grader_model']
+    
+    # Create solver loader
+    solver_loader = create_loader(
+        solver_provider,
+        solver_model=solver_model,
+        grader_model=solver_model,  # Use solver model for both in solver loader
+        **solver_kwargs
+    )
+    
+    # Create grader loader if different provider
+    if grader_provider != solver_provider:
+        grader_loader = create_loader(
+            grader_provider,
+            solver_model=grader_model,  # Use grader model for both in grader loader
+            grader_model=grader_model,
+            **grader_kwargs
+        )
+        return CrossProviderLoader(solver_loader, grader_loader, **kwargs)
+    else:
+        # Same provider, but possibly different models
+        if solver_model != grader_model:
+            # Need to create a single loader with both models
+            single_loader = create_loader(
+                solver_provider,
+                solver_model=solver_model,
+                grader_model=grader_model,
+                **solver_kwargs
+            )
+            return single_loader
+        else:
+            # Same provider and model
+            return solver_loader
+
+def get_supported_providers() -> list[str]:
+    """
+    Get list of supported model providers.
+    
+    Returns:
+        List of supported provider names
+    """
+    return ["openai", "anthropic", "gemini", "xai", "openrouter", "kimi", "vllm", "vllm_direct", "huggingface"]
+
+def get_default_models(provider: str) -> dict[str, str]:
+    """
+    Get default model names for a provider.
+    
+    Args:
+        provider: Provider name
+        
+    Returns:
+        Dictionary with default solver_model and grader_model
+    """
+    defaults = {
+        "openai": {
+            "solver_model": "gpt-4o-mini",
+            "grader_model": "o3"
+        },
+        "anthropic": {
+            "solver_model": "claude-3-5-haiku-20241022",
+            "grader_model": "claude-3-5-sonnet-20241022"
+        },
+        "gemini": {
+            "solver_model": "gemini-1.5-flash",
+            "grader_model": "gemini-1.5-pro"
+        },
+        "xai": {
+            "solver_model": "grok-3",
+            "grader_model": "grok-3"
+        },
+        "openrouter": {
+            "solver_model": "openai/gpt-4o",
+            "grader_model": "openai/gpt-4o"
+        },
+        "kimi": {
+            "solver_model": "moonshot-v1-8k",
+            "grader_model": "moonshot-v1-8k"
+        },
+        "vllm": {
+            "solver_model": "meta-llama/Llama-3.2-3B-Instruct",
+            "grader_model": "meta-llama/Llama-3.2-8B-Instruct"
+        },
+        "vllm_direct": {
+            "solver_model": "gpt2",
+            "grader_model": "gpt2"
+        },
+        "huggingface": {
+            "solver_model": "microsoft/DialoGPT-medium",
+            "grader_model": "microsoft/DialoGPT-large"
+        }
+    }
+    
+    provider_lower = provider.lower()
+    if provider_lower not in defaults:
+        raise ValueError(f"No defaults available for provider: {provider}")
+    
+    return defaults[provider_lower]
+
+# Export main classes and functions
+__all__ = [
+    # Main classes
+    "ModelLoader",
+    "OpenAIModelLoader",
+    "AnthropicModelLoader",
+    "GeminiModelLoader",
+    "XAIModelLoader",
+    "OpenRouterModelLoader",
+    "KimiModelLoader",
+    "VLLMModelLoader",
+    "VLLMDirectModelLoader",
+    "HuggingFaceModelLoader",
+    "CrossProviderLoader",
+    
+    # Factory functions
+    "create_loader",
+    "create_cross_provider_loader",
+    "get_supported_providers", 
+    "get_default_models",
+    
+    # Prompts (for advanced users)
+    "SOLVER_SYSTEM_PROMPT",
+    "SOLVER_USER_TEMPLATE",
+    "PROOF_GRADER_SYSTEM_PROMPT",
+    "CALCULATION_GRADER_SYSTEM_PROMPT",
+    "PROOF_GRADER_USER_TEMPLATE",
+    "CALCULATION_GRADER_USER_TEMPLATE",
+    
+    # Configuration constants
+    "RESPONSE_FORMAT",
+    "DEFAULT_RETRIES",
+    "DEFAULT_TIMEOUT_BASE"
+]
diff --git a/putnam-bench-anon/loader/anthropic_client.py b/putnam-bench-anon/loader/anthropic_client.py
new file mode 100644
index 0000000..e81f220
--- /dev/null
+++ b/putnam-bench-anon/loader/anthropic_client.py
@@ -0,0 +1,227 @@
+"""
+Anthropic model loader implementation.
+Handles API calls to Anthropic Claude models with proper error handling and retry logic.
+"""
+
+import asyncio
+import random
+from typing import Dict, List, Tuple, Optional
+
+try:
+    from anthropic import AsyncAnthropic, RateLimitError, APIError, APIConnectionError
+except ImportError:
+    AsyncAnthropic = None
+    RateLimitError = Exception
+    APIError = Exception
+    APIConnectionError = Exception
+
+from .base import ModelLoader
+from .prompts import RESPONSE_FORMAT
+
+
+class AnthropicModelLoader(ModelLoader):
+    """Anthropic implementation of the ModelLoader."""
+    
+    def __init__(self, 
+                 solver_model: str = "claude-3-5-haiku-20241022",
+                 grader_model: str = "claude-3-5-sonnet-20241022",
+                 api_key: Optional[str] = None,
+                 base_url: Optional[str] = None,
+                 **kwargs):
+        """
+        Initialize Anthropic model loader.
+        
+        Args:
+            solver_model: Anthropic model for solving problems (default: claude-3-5-haiku)
+            grader_model: Anthropic model for grading solutions (default: claude-3-5-sonnet)
+            api_key: Anthropic API key (if None, uses environment variable)
+            base_url: Custom base URL for Anthropic API
+            **kwargs: Additional arguments passed to parent class
+        """
+        if AsyncAnthropic is None:
+            raise ImportError(
+                "anthropic package is required for AnthropicModelLoader. "
+                "Install with: pip install anthropic"
+            )
+            
+        super().__init__(solver_model, grader_model, **kwargs)
+        
+        # Initialize Anthropic client
+        client_kwargs = {}
+        if api_key:
+            client_kwargs["api_key"] = api_key
+        if base_url:
+            client_kwargs["base_url"] = base_url
+            
+        self.client = AsyncAnthropic(**client_kwargs)
+    
+    async def _call_api(self, 
+                       model: str, 
+                       messages: List[Dict[str, str]], 
+                       temperature: float = 0.0) -> Tuple[Optional[str], str]:
+        """
+        Make an API call to Anthropic.
+        
+        Args:
+            model: Anthropic model name
+            messages: List of messages in chat format
+            temperature: Temperature for generation
+            
+        Returns:
+            Tuple of (response_content, raw_response)
+        """
+        try:
+            # Convert OpenAI format to Anthropic format
+            system_message = None
+            user_messages = []
+            
+            for msg in messages:
+                if msg["role"] == "system":
+                    system_message = msg["content"]
+                else:
+                    user_messages.append(msg)
+            
+            # Prepare API call parameters
+            api_params = {
+                "model": model,
+                "messages": user_messages,
+                "max_tokens": 4000,  # Anthropic requires max_tokens
+                "temperature": temperature,
+            }
+            
+            if system_message:
+                api_params["system"] = system_message
+            
+            # Make the API call
+            response = await self.client.messages.create(**api_params)
+            
+            # Extract response content
+            content = ""
+            if response.content:
+                for block in response.content:
+                    if hasattr(block, 'text'):
+                        content += block.text
+            
+            return content, content
+            
+        except RateLimitError as e:
+            # Handle rate limiting with special logic
+            error_str = str(e)
+            print(f"🚫 RateLimitError: {error_str}")
+            
+            if "insufficient_quota" in error_str.lower():
+                print("⏳ Detected quota exhaustion - sleeping 15 minutes")
+                await asyncio.sleep(900)  # 15 minutes
+            else:
+                # Standard rate limit - shorter sleep
+                sleep_time = 2 + random.random()
+                print(f"   ⏰ Rate limited, sleeping {sleep_time:.1f}s")
+                await asyncio.sleep(sleep_time)
+            
+            # Re-raise to trigger retry logic
+            raise
+            
+        except (APIError, APIConnectionError) as e:
+            print(f"❌ Anthropic API Error: {str(e)}")
+            raise
+            
+        except Exception as e:
+            print(f"❌ Unexpected error in Anthropic API call: {str(e)}")
+            raise
+    
+    def get_model_info(self) -> Dict[str, str]:
+        """Get information about the configured models."""
+        return {
+            "solver_model": self.solver_model,
+            "grader_model": self.grader_model,
+            "provider": "anthropic"
+        }
+    
+    async def health_check(self) -> bool:
+        """
+        Perform a simple health check to verify API connectivity.
+        
+        Returns:
+            True if API is accessible, False otherwise
+        """
+        try:
+            # Simple test call
+            test_messages = [
+                {"role": "user", "content": "Hello, please respond with a simple JSON: {\"status\": \"ok\"}"}
+            ]
+            
+            result, _ = await self._call_api(
+                model=self.solver_model,
+                messages=test_messages,
+                temperature=0.0
+            )
+            
+            if result and "ok" in result.lower():
+                print(f"✅ Anthropic API health check passed for {self.solver_model}")
+                return True
+            else:
+                print(f"⚠️ Anthropic API health check returned unexpected response")
+                return False
+                
+        except Exception as e:
+            print(f"❌ Anthropic API health check failed: {str(e)}")
+            return False
+    
+    async def estimate_cost(self, 
+                          num_problems: int, 
+                          avg_problem_length: int = 1000,
+                          avg_solution_length: int = 2000) -> Dict[str, float]:
+        """
+        Estimate the cost for processing a given number of problems.
+        
+        Args:
+            num_problems: Number of problems to process
+            avg_problem_length: Average length of problem statements in characters
+            avg_solution_length: Average length of solutions in characters
+            
+        Returns:
+            Dictionary with cost estimates
+        """
+        # Rough token estimates (1 token ≈ 4 characters for English)
+        tokens_per_solve = (avg_problem_length + avg_solution_length) // 4
+        tokens_per_grade = (avg_problem_length + avg_solution_length * 2) // 4
+        
+        # Anthropic pricing (update with actual Anthropic pricing)
+        # These are rough estimates and should be updated with current pricing
+        pricing = {
+            "claude-3-5-haiku-20241022": {"input": 0.0008, "output": 0.004},  # per 1K tokens
+            "claude-3-5-sonnet-20241022": {"input": 0.003, "output": 0.015},  # per 1K tokens
+            "claude-3-opus-20240229": {"input": 0.015, "output": 0.075},  # per 1K tokens
+            "claude-3-haiku-20240307": {"input": 0.00025, "output": 0.00125},  # per 1K tokens
+        }
+        
+        def get_model_cost(model: str, input_tokens: int, output_tokens: int) -> float:
+            if model not in pricing:
+                model = "claude-3-5-sonnet-20241022"  # Default fallback
+            
+            input_cost = (input_tokens / 1000) * pricing[model]["input"]
+            output_cost = (output_tokens / 1000) * pricing[model]["output"]
+            return input_cost + output_cost
+        
+        # Calculate costs
+        solve_cost = get_model_cost(
+            self.solver_model, 
+            tokens_per_solve * num_problems,
+            tokens_per_solve * num_problems // 2  # Assume output is ~50% of input
+        )
+        
+        grade_cost = get_model_cost(
+            self.grader_model,
+            tokens_per_grade * num_problems,
+            tokens_per_grade * num_problems // 3  # Assume output is ~33% of input
+        )
+        
+        total_cost = solve_cost + grade_cost
+        
+        return {
+            "solve_cost": round(solve_cost, 4),
+            "grade_cost": round(grade_cost, 4),
+            "total_cost": round(total_cost, 4),
+            "cost_per_problem": round(total_cost / num_problems, 6),
+            "currency": "USD"
+        }
diff --git a/putnam-bench-anon/loader/base.py b/putnam-bench-anon/loader/base.py
new file mode 100644
index 0000000..5e24a8f
--- /dev/null
+++ b/putnam-bench-anon/loader/base.py
@@ -0,0 +1,514 @@
+"""
+Abstract base class for model loaders.
+Defines the interface for mathematical problem solving and grading.
+"""
+
+import re
+import json
+import asyncio
+import random
+from abc import ABC, abstractmethod
+from typing import Dict, List, Tuple, Optional, Any
+
+from .prompts import (
+    SOLVER_SYSTEM_PROMPT,
+    SOLVER_USER_TEMPLATE,
+    PROOF_GRADER_SYSTEM_PROMPT,
+    CALCULATION_GRADER_SYSTEM_PROMPT,
+    PROOF_GRADER_USER_TEMPLATE,
+    CALCULATION_GRADER_USER_TEMPLATE,
+    RESPONSE_FORMAT,
+    DEFAULT_RETRIES,
+    DEFAULT_TIMEOUT_BASE
+)
+
+# JSON extraction regex
+JSON_RE = re.compile(r"\{[\s\S]*\}")
+
+
+class ModelLoader(ABC):
+    """Abstract base class for model loaders."""
+    
+    def __init__(self, 
+                 solver_model: str,
+                 grader_model: str,
+                 retries: int = DEFAULT_RETRIES,
+                 timeout_base: int = DEFAULT_TIMEOUT_BASE,
+                 debug: bool = False,
+                 quick: bool = False):
+        """
+        Initialize the model loader.
+        
+        Args:
+            solver_model: Model name for solving problems
+            grader_model: Model name for grading solutions
+            retries: Number of retry attempts for API calls
+            timeout_base: Base timeout in seconds for API calls
+            debug: Enable debug logging for JSON parsing
+            quick: Quick mode - allows one retry with 1200s timeout each attempt
+        """
+        self.solver_model = solver_model
+        self.grader_model = grader_model
+        self.retries = retries
+        self.timeout_base = timeout_base
+        self.debug = debug
+        self.quick = quick
+        
+        # Override settings for quick mode
+        if self.quick:
+            self.retries = 1  # Only try once
+            self.timeout_base = 1200  # 20 minutes timeout
+    
+    @abstractmethod
+    async def _call_api(self, 
+                       model: str, 
+                       messages: List[Dict[str, str]], 
+                       temperature: float = 0.0) -> Tuple[Optional[str], str]:
+        """
+        Make an API call to the model.
+        
+        Args:
+            model: Model name to use
+            messages: List of messages in chat format
+            temperature: Temperature for generation
+            
+        Returns:
+            Tuple of (parsed_response, raw_response)
+        """
+        pass
+    
+    def parse_json_response(self, raw: str, debug: bool = False) -> Optional[Dict]:
+        """Parse JSON from LLM response with fallback strategies."""
+        if not raw:
+            return None
+        
+        # Try direct JSON parse
+        try:
+            return json.loads(raw)
+        except Exception as e:
+            if debug:
+                print(f"⚠️ Direct JSON parse failed: {e}")
+        
+        # Try to find JSON in the response
+        match = JSON_RE.search(raw)
+        if match:
+            try:
+                return json.loads(match.group(0))
+            except Exception as e:
+                if debug:
+                    print(f"⚠️ Regex JSON parse failed: {e}")
+        
+        # Try fixing common JSON issues including control characters
+        try:
+            # Fix escaped quotes and backslashes
+            fixed = raw.replace('\\"', '"').replace('\\\\', '\\')
+            
+            # Fix unescaped newlines and other control characters in JSON strings
+            # This is a more robust approach for LLM responses
+            import ast
+            
+            # Try to use ast.literal_eval if it's a simple dict-like structure
+            if fixed.strip().startswith('{') and fixed.strip().endswith('}'):
+                try:
+                    # Replace common problematic patterns
+                    fixed = fixed.replace('\n', '\\n').replace('\r', '\\r').replace('\t', '\\t')
+                    return json.loads(fixed)
+                except Exception as e:
+                    if debug:
+                        print(f"⚠️ Fixed JSON parse failed: {e}")
+            
+        except Exception as e:
+            if debug:
+                print(f"⚠️ JSON fixing failed: {e}")
+        
+        # ENHANCED: Try to complete truncated JSON
+        try:
+            if raw.strip().startswith('{') and not raw.strip().endswith('}'):
+                if debug:
+                    print("🔧 Attempting to fix truncated JSON...")
+                
+                # Try to find the last complete key-value pair
+                # Look for solution content
+                if '"solution"' in raw:
+                    # Extract solution up to the truncation point
+                    solution_start = raw.find('"solution"')
+                    solution_content = raw[solution_start:]
+                    
+                    # Find the actual solution text
+                    import re
+                    solution_match = re.search(r'"solution":\s*"([^"]*(?:\\"[^"]*)*)', raw, re.DOTALL)
+                    if solution_match:
+                        solution_text = solution_match.group(1)
+                        # Clean up the solution text
+                        solution_text = solution_text.replace('\\"', '"').replace('\\n', '\n')
+                        
+                        if debug:
+                            print(f"🔧 Extracted solution from truncated JSON ({len(solution_text)} chars)")
+                        return {
+                            "solution": solution_text,
+                            "final_answer": "Solution was truncated - see solution field for complete answer"
+                        }
+        except Exception as e:
+            if debug:
+                print(f"⚠️ Truncated JSON recovery failed: {e}")
+        
+        # Final fallback: try to extract key-value pairs manually
+        try:
+            if '"solution"' in raw:
+                import re
+                
+                if debug:
+                    print("🔧 Attempting manual key-value extraction...")
+                
+                # Extract solution (more robust pattern)
+                solution_match = re.search(r'"solution":\s*"([^"]*(?:\\"[^"]*)*)', raw, re.DOTALL)
+                solution = solution_match.group(1) if solution_match else ""
+                
+                # Extract final_answer if it exists
+                answer_match = re.search(r'"final_answer":\s*"([^"]*)"', raw)
+                final_answer = answer_match.group(1) if answer_match else ""
+                
+                if solution:
+                    # Clean up the solution text
+                    solution = solution.replace('\\"', '"').replace('\\n', '\n')
+                    
+                    if debug:
+                        print(f"🔧 Manual extraction successful ({len(solution)} chars solution)")
+                    return {
+                        "solution": solution,
+                        "final_answer": final_answer if final_answer else "See solution field for complete answer"
+                    }
+        except Exception as e:
+            if debug:
+                print(f"⚠️ Manual extraction failed: {e}")
+        
+        if debug:
+            print("❌ All JSON parsing strategies failed")
+        return None
+    
+    def to_str(self, x) -> str:
+        """Convert various types to string safely."""
+        if x is None:
+            return ""
+        if isinstance(x, str):
+            return x
+        if isinstance(x, (list, tuple)):
+            return "\n".join(map(str, x))
+        return str(x)
+    
+    async def call_api_with_retry(self, 
+                                 model: str, 
+                                 messages: List[Dict[str, str]], 
+                                 temperature: float = 0.0) -> Tuple[Optional[Dict], str]:
+        """
+        Make API call with retry logic and JSON parsing.
+        
+        Args:
+            model: Model name to use
+            messages: List of messages in chat format
+            temperature: Temperature for generation
+            
+        Returns:
+            Tuple of (parsed_json_response, raw_response)
+        """
+        raw_response = ""
+        
+        # In quick mode, we allow one retry with a fixed timeout
+        if self.quick:
+            max_attempts = 2  # Allow one retry in quick mode
+            if self.debug:
+                print(f"⚡ Quick mode: Up to {max_attempts} attempts with {self.timeout_base}s timeout each")
+            
+            for attempt in range(1, max_attempts + 1):
+                try:
+                    if attempt > 1 and self.debug:
+                        print(f"🔄 Quick mode retry attempt {attempt}/{max_attempts}")
+                    
+                    parsed, raw_response = await asyncio.wait_for(
+                        self._call_api(model, messages, temperature),
+                        timeout=self.timeout_base
+                    )
+                    
+                    if parsed:
+                        # Try to parse as JSON
+                        debug_mode = getattr(self, 'debug', False)
+                        json_parsed = self.parse_json_response(parsed, debug=debug_mode)
+                        if json_parsed:
+                            return json_parsed, raw_response
+                        return None, raw_response
+                    else:
+                        raise ValueError("Empty response from API")
+                        
+                except Exception as e:
+                    error_type = type(e).__name__
+                    error_msg = str(e)
+                    print(f"❌ {error_type} in quick mode (attempt {attempt}/{max_attempts}): {error_msg}")
+                    
+                    # If this was the last attempt, mark as failed
+                    if attempt == max_attempts:
+                        return {"_max_retries_reached": True, "error": str(e)}, raw_response
+                    
+                    # Otherwise, wait a bit before retrying
+                    if self.debug:
+                        print("⏳ Waiting 5 seconds before retry...")
+                    await asyncio.sleep(5)
+        
+        # Regular mode with retries
+        for attempt in range(1, self.retries + 1):
+            # More aggressive timeout scaling for persistent failures
+            # Cap timeout at 30 minutes to prevent extremely long waits
+            timeout = min(self.timeout_base * (1.5 ** (attempt - 1)), 1800)
+            if self.debug:
+                print(f"🔄 Attempt {attempt}/{self.retries} with timeout {timeout:.0f}s")
+            try:
+                parsed, raw_response = await asyncio.wait_for(
+                    self._call_api(model, messages, temperature),
+                    timeout=timeout
+                )
+                
+                if parsed:
+                    # Try to parse as JSON
+                    debug_mode = getattr(self, 'debug', False)
+                    json_parsed = self.parse_json_response(parsed, debug=debug_mode)
+                    if json_parsed:
+                        return json_parsed, raw_response
+                    return None, raw_response
+                else:
+                    raise ValueError("Empty response from API")
+                    
+            except Exception as e:
+                error_type = type(e).__name__
+                error_msg = str(e)
+                
+                # Only show detailed error info on first attempt or in debug mode
+                if attempt == 1 or self.debug:
+                    print(f"❌ {error_type} (attempt {attempt}/{self.retries}): {error_msg}")
+                
+                if attempt == self.retries:
+                    print(f"🔥 All {self.retries} retry attempts exhausted for {error_type}")
+                    # Return a special marker for max retries reached
+                    return {"_max_retries_reached": True, "error": str(e)}, raw_response
+                
+                # Custom retry strategy: 600s -> 900s -> 900s -> 1200s...
+                if attempt == 1:
+                    base_sleep = 600  # 10 minutes
+                elif attempt == 2 or attempt == 3:
+                    base_sleep = 900  # 15 minutes
+                else:
+                    base_sleep = 1200  # 20 minutes
+                
+                # Add small random jitter to avoid synchronized retries
+                jitter = random.uniform(0, 30)  # 0-30 seconds jitter
+                sleep_time = base_sleep + jitter
+                
+                if self.debug:
+                    print(f"   ⏰ Using custom backoff strategy: {base_sleep}s base + {jitter:.1f}s jitter")
+                
+                if self.debug:
+                    print(f"   ⏰ Retrying in {sleep_time:.1f}s")
+                await asyncio.sleep(sleep_time)
+        
+        return None, raw_response
+    
+    async def solve_problem(self, problem_statement: str, model: Optional[str] = None) -> Tuple[Optional[Dict], str]:
+        """
+        Have model solve mathematical problems.
+        
+        Args:
+            problem_statement: Problem statement
+            model: Model name to use for solving (if None, uses default solver_model)
+            
+        Returns:
+            Tuple of (solving result dictionary, raw response)
+            Solving result contains: {"solution": "detailed solution", "final_answer": "final answer"}
+        """
+        messages = [
+            {"role": "system", "content": SOLVER_SYSTEM_PROMPT},
+            {"role": "user", "content": SOLVER_USER_TEMPLATE.format(
+                problem_statement=problem_statement
+            )}
+        ]
+        
+        # Use specified model or default solver model
+        solver_model = model if model is not None else self.solver_model
+        
+        # Set temperature based on model
+        # o3, o3-mini, and o4-mini require temperature 1.0
+        if any(model_name in solver_model.lower() for model_name in ['o3', 'o3-mini', 'o4-mini']):
+            temperature = 1.0
+        else:
+            # Use temperature 0.0 for deterministic solving with other models
+            temperature = 0.0
+        
+        return await self.call_api_with_retry(solver_model, messages, temperature=temperature)
+    
+    async def grade_solution(self, 
+                           problem_statement: str, 
+                           solution: str,
+                           reference_solution: str, 
+                           problem_type: str = "proof",
+                           model: Optional[str] = None) -> Tuple[Optional[Dict], str]:
+        """
+        Have model grade solution based on problem type.
+        
+        Args:
+            problem_statement: Problem statement
+            solution: Student solution
+            reference_solution: Reference solution
+            problem_type: Problem type ("proof" strict grading, "calculation" lenient grading)
+            model: Model name to use for grading (if None, uses default grader_model)
+            
+        Returns:
+            Tuple of (grading result dictionary, raw response)
+            Grading result contains: {"grade": "CORRECT"/"INCORRECT", "detailed_feedback": "...", ...}
+        """
+        if problem_type == "calculation":
+            system_prompt = CALCULATION_GRADER_SYSTEM_PROMPT
+            user_template = CALCULATION_GRADER_USER_TEMPLATE
+        else:  # Default to proof (strict grading)
+            system_prompt = PROOF_GRADER_SYSTEM_PROMPT
+            user_template = PROOF_GRADER_USER_TEMPLATE
+        
+        messages = [
+            {"role": "system", "content": system_prompt},
+            {"role": "user", "content": user_template.format(
+                problem_statement=problem_statement,
+                solution=solution,
+                reference_solution=reference_solution
+            )}
+        ]
+        
+        # Use specified model or default grader model
+        grader_model = model if model is not None else self.grader_model
+        
+        # Use temperature 1.0 for grading (as per original script for o3)
+        return await self.call_api_with_retry(grader_model, messages, temperature=1.0)
+    
+    async def test_single_problem(self, 
+                                data: Dict, 
+                                variant_type: str = "original",
+                                solver_model: Optional[str] = None,
+                                grader_model: Optional[str] = None) -> Dict:
+        """
+        Test complete workflow for single problem: solving + grading.
+        
+        Args:
+            data: Problem data dictionary
+            variant_type: Problem variant type ("original" or key names in variants)
+            solver_model: Model name for solving (if None, uses default solver_model)
+            grader_model: Model name for grading (if None, uses default grader_model)
+            
+        Returns:
+            Test result dictionary
+        """
+        index = data.get("index", "unknown")
+        problem_type = data.get("problem_type", "proof")
+        
+        try:
+            # Get problem and reference solution
+            if variant_type == "original":
+                question = self.to_str(data.get("question", "")).strip()
+                reference_solution = self.to_str(data.get("solution", "")).strip()
+            else:
+                variant = data.get("variants", {}).get(variant_type)
+                if not variant:
+                    return {
+                        "index": index,
+                        "variant_type": variant_type,
+                        "status": "skipped",
+                        "reason": f"no_{variant_type}_variant"
+                    }
+                question = self.to_str(variant.get("question", "")).strip()
+                reference_solution = self.to_str(variant.get("solution", "")).strip()
+            
+            if not question or not reference_solution:
+                return {
+                    "index": index,
+                    "variant_type": variant_type,
+                    "status": "skipped",
+                    "reason": "missing_fields"
+                }
+            
+            result = {
+                "index": index,
+                "variant_type": variant_type,
+                "problem_type": problem_type,
+                "status": "completed",
+                "solve": {},
+                "grade": {}
+            }
+            
+            # 1. Solve problem
+            solve_result, solve_raw = await self.solve_problem(question, model=solver_model)
+            
+            # Check if max retries reached
+            if solve_result and solve_result.get("_max_retries_reached"):
+                # Mark as completed but with INCORRECT grade due to max retries
+                result["solve"]["status"] = "max_retries"
+                result["solve"]["solution"] = "Failed to generate solution after maximum retry attempts"
+                result["solve"]["final_answer"] = "No answer - max retries reached"
+                result["grade"]["status"] = "auto_failed"
+                result["grade"]["grade"] = "INCORRECT"
+                result["grade"]["detailed_feedback"] = f"Automatically marked as incorrect due to reaching maximum retry limit ({self.retries} attempts)"
+                result["grade"]["major_issues"] = "API call failed after all retry attempts"
+                result["grade"]["final_answer_correct"] = False
+                result["grade"]["reasoning_rigor_score"] = 0
+                result["grade"]["overall_assessment"] = "Failed to generate solution"
+                result["correct"] = False
+                result["status"] = "completed"  # Mark as completed, not failed
+                return result
+            
+            if not solve_result:
+                result["solve"]["status"] = "failed"
+                result["status"] = "failed"
+                return result
+            
+            student_solution = self.to_str(solve_result.get("solution", "")).strip()
+            final_answer = self.to_str(solve_result.get("final_answer", "")).strip()
+            
+            result["solve"]["status"] = "success"
+            result["solve"]["solution"] = student_solution
+            result["solve"]["final_answer"] = final_answer
+            
+            # 2. Grade solution
+            grade_result, grade_raw = await self.grade_solution(
+                question, student_solution, reference_solution, problem_type, model=grader_model
+            )
+            
+            # Check if grading max retries reached
+            if grade_result and grade_result.get("_max_retries_reached"):
+                # Mark as completed but with INCORRECT grade due to max retries in grading
+                result["grade"]["status"] = "auto_failed"
+                result["grade"]["grade"] = "INCORRECT"
+                result["grade"]["detailed_feedback"] = f"Automatically marked as incorrect due to grading reaching maximum retry limit ({self.retries} attempts)"
+                result["grade"]["major_issues"] = "Grading API call failed after all retry attempts"
+                result["grade"]["final_answer_correct"] = False
+                result["grade"]["reasoning_rigor_score"] = 0
+                result["grade"]["overall_assessment"] = "Failed to grade solution"
+                result["correct"] = False
+                result["status"] = "completed"  # Mark as completed, not partial/failed
+            elif not grade_result:
+                result["grade"]["status"] = "failed"
+                result["status"] = "partial"  # solving succeeded but grading failed
+            else:
+                result["grade"]["status"] = "success"
+                result["grade"]["grade"] = grade_result.get("grade", "UNKNOWN")
+                result["grade"]["detailed_feedback"] = grade_result.get("detailed_feedback", "")
+                result["grade"]["major_issues"] = grade_result.get("major_issues", "")
+                result["grade"]["final_answer_correct"] = grade_result.get("final_answer_correct", False)
+                result["grade"]["reasoning_rigor_score"] = grade_result.get("reasoning_rigor_score", 0)
+                result["grade"]["overall_assessment"] = grade_result.get("overall_assessment", "")
+                
+                # Mark whether correct
+                result["correct"] = grade_result.get("grade") == "CORRECT"
+            
+            return result
+            
+        except Exception as e:
+            return {
+                "index": index,
+                "variant_type": variant_type,
+                "status": "error",
+                "error": str(e),
+                "error_type": type(e).__name__
+            }
diff --git a/putnam-bench-anon/loader/cross_provider.py b/putnam-bench-anon/loader/cross_provider.py
new file mode 100644
index 0000000..afd833c
--- /dev/null
+++ b/putnam-bench-anon/loader/cross_provider.py
@@ -0,0 +1,155 @@
+"""
+Cross-provider model loader implementation.
+Allows using different providers for solving and grading tasks.
+"""
+
+from typing import Dict, Optional, Tuple, Any
+from .base import ModelLoader
+
+
+class CrossProviderLoader(ModelLoader):
+    """Wrapper that allows using different providers for solving and grading."""
+    
+    def __init__(self, 
+                 solver_loader: ModelLoader,
+                 grader_loader: Optional[ModelLoader] = None,
+                 **kwargs):
+        """
+        Initialize cross-provider loader.
+        
+        Args:
+            solver_loader: ModelLoader instance for solving problems
+            grader_loader: ModelLoader instance for grading (if None, uses solver_loader)
+            **kwargs: Additional arguments passed to parent class
+        """
+        # If no grader loader specified, use the solver loader for both
+        self.solver_loader = solver_loader
+        self.grader_loader = grader_loader or solver_loader
+        
+        # Initialize parent with combined model info
+        super().__init__(
+            solver_model=solver_loader.solver_model,
+            grader_model=self.grader_loader.grader_model,
+            **kwargs
+        )
+        
+        # Track if we're using cross-provider
+        self.is_cross_provider = grader_loader is not None and grader_loader != solver_loader
+    
+    async def _call_api(self, 
+                       model: str, 
+                       messages: list[Dict[str, str]], 
+                       temperature: float = 0.0) -> Tuple[Optional[str], str]:
+        """
+        Route API calls to the appropriate provider based on the model.
+        
+        Args:
+            model: Model name to use
+            messages: List of messages in chat format
+            temperature: Temperature for generation
+            
+        Returns:
+            Tuple of (response_content, raw_response)
+        """
+        # Determine which loader to use based on the model
+        if model == self.solver_model:
+            return await self.solver_loader._call_api(model, messages, temperature)
+        elif model == self.grader_model:
+            return await self.grader_loader._call_api(model, messages, temperature)
+        else:
+            # Try to determine based on which loader has the model
+            if hasattr(self.solver_loader, 'solver_model') and model == self.solver_loader.solver_model:
+                return await self.solver_loader._call_api(model, messages, temperature)
+            elif hasattr(self.grader_loader, 'grader_model') and model == self.grader_loader.grader_model:
+                return await self.grader_loader._call_api(model, messages, temperature)
+            else:
+                raise ValueError(f"Model {model} not found in either solver or grader loader")
+    
+    def get_model_info(self) -> Dict[str, Any]:
+        """Get information about the configured models and providers."""
+        solver_info = self.solver_loader.get_model_info()
+        grader_info = self.grader_loader.get_model_info()
+        
+        return {
+            "solver_model": self.solver_model,
+            "grader_model": self.grader_model,
+            "solver_provider": solver_info.get("provider", "unknown"),
+            "grader_provider": grader_info.get("provider", "unknown"),
+            "is_cross_provider": self.is_cross_provider,
+            "solver_info": solver_info,
+            "grader_info": grader_info
+        }
+    
+    async def health_check(self) -> bool:
+        """
+        Perform health checks on both providers.
+        
+        Returns:
+            True if both providers are healthy, False otherwise
+        """
+        print("🔍 Checking solver provider health...")
+        solver_health = await self.solver_loader.health_check()
+        
+        if self.is_cross_provider:
+            print("🔍 Checking grader provider health...")
+            grader_health = await self.grader_loader.health_check()
+            return solver_health and grader_health
+        else:
+            return solver_health
+    
+    async def estimate_cost(self, 
+                          num_problems: int, 
+                          avg_problem_length: int = 1000,
+                          avg_solution_length: int = 2000) -> Dict[str, float]:
+        """
+        Estimate costs for both providers.
+        
+        Args:
+            num_problems: Number of problems to process
+            avg_problem_length: Average length of problem statements in characters
+            avg_solution_length: Average length of solutions in characters
+            
+        Returns:
+            Dictionary with combined cost estimates
+        """
+        # Get solver costs
+        solver_costs = await self.solver_loader.estimate_cost(
+            num_problems, avg_problem_length, avg_solution_length
+        )
+        
+        if self.is_cross_provider:
+            # Get grader costs separately
+            grader_costs = await self.grader_loader.estimate_cost(
+                num_problems, avg_problem_length, avg_solution_length
+            )
+            
+            # Combine costs
+            return {
+                "solver_cost": solver_costs.get("solve_cost", 0),
+                "grader_cost": grader_costs.get("grade_cost", 0),
+                "total_cost": solver_costs.get("solve_cost", 0) + grader_costs.get("grade_cost", 0),
+                "solver_provider": self.solver_loader.get_model_info().get("provider"),
+                "grader_provider": self.grader_loader.get_model_info().get("provider"),
+                "solver_model": self.solver_model,
+                "grader_model": self.grader_model,
+                "num_problems": num_problems,
+                "note": "Cross-provider costs combined"
+            }
+        else:
+            # Single provider costs
+            return solver_costs
+    
+    async def __aenter__(self):
+        """Async context manager entry."""
+        if hasattr(self.solver_loader, '__aenter__'):
+            await self.solver_loader.__aenter__()
+        if self.is_cross_provider and hasattr(self.grader_loader, '__aenter__'):
+            await self.grader_loader.__aenter__()
+        return self
+    
+    async def __aexit__(self, exc_type, exc_val, exc_tb):
+        """Async context manager exit."""
+        if hasattr(self.solver_loader, '__aexit__'):
+            await self.solver_loader.__aexit__(exc_type, exc_val, exc_tb)
+        if self.is_cross_provider and hasattr(self.grader_loader, '__aexit__'):
+            await self.grader_loader.__aexit__(exc_type, exc_val, exc_tb) 
+\ No newline at end of file
diff --git a/putnam-bench-anon/loader/gemini_client.py b/putnam-bench-anon/loader/gemini_client.py
new file mode 100644
index 0000000..3ff0be0
--- /dev/null
+++ b/putnam-bench-anon/loader/gemini_client.py
@@ -0,0 +1,239 @@
+"""
+Gemini model loader implementation.
+Handles API calls to Google Gemini models with proper error handling and retry logic.
+"""
+
+import asyncio
+import random
+from typing import Dict, List, Tuple, Optional
+
+try:
+    import google.generativeai as genai
+    from google.generativeai.types import generation_types
+except ImportError:
+    genai = None
+    generation_types = None
+
+from .base import ModelLoader
+from .prompts import RESPONSE_FORMAT
+
+
+class GeminiModelLoader(ModelLoader):
+    """Gemini implementation of the ModelLoader."""
+    
+    def __init__(self, 
+                 solver_model: str = "gemini-1.5-flash",
+                 grader_model: str = "gemini-1.5-pro",
+                 api_key: Optional[str] = None,
+                 **kwargs):
+        """
+        Initialize Gemini model loader.
+        
+        Args:
+            solver_model: Gemini model for solving problems (default: gemini-1.5-flash)
+            grader_model: Gemini model for grading solutions (default: gemini-1.5-pro)
+            api_key: Google AI API key (if None, uses environment variable GOOGLE_API_KEY)
+            **kwargs: Additional arguments passed to parent class
+        """
+        if genai is None:
+            raise ImportError(
+                "google-generativeai package is required for GeminiModelLoader. "
+                "Install with: pip install google-generativeai"
+            )
+            
+        super().__init__(solver_model, grader_model, **kwargs)
+        
+        # Configure Google AI
+        if api_key:
+            genai.configure(api_key=api_key)
+        else:
+            # Will use GOOGLE_API_KEY environment variable
+            genai.configure()
+    
+    async def _call_api(self, 
+                       model: str, 
+                       messages: List[Dict[str, str]], 
+                       temperature: float = 0.0) -> Tuple[Optional[str], str]:
+        """
+        Make an API call to Gemini.
+        
+        Args:
+            model: Gemini model name
+            messages: List of messages in chat format
+            temperature: Temperature for generation
+            
+        Returns:
+            Tuple of (response_content, raw_response)
+        """
+        try:
+            # Initialize the model
+            model_instance = genai.GenerativeModel(model)
+            
+            # Convert OpenAI format to Gemini format
+            system_instruction = None
+            conversation = []
+            
+            for msg in messages:
+                if msg["role"] == "system":
+                    system_instruction = msg["content"]
+                elif msg["role"] == "user":
+                    conversation.append({"role": "user", "parts": [msg["content"]]})
+                elif msg["role"] == "assistant":
+                    conversation.append({"role": "model", "parts": [msg["content"]]})
+            
+            # Configure generation parameters
+            generation_config = genai.types.GenerationConfig(
+                temperature=temperature,
+                max_output_tokens=4000,
+            )
+            
+            # Request JSON format for all Gemini models
+            # Flash models now support JSON format as per latest API documentation
+            generation_config.response_mime_type = "application/json"
+            
+            # Make the API call
+            if system_instruction and len(conversation) == 1:
+                # Single user message with system instruction
+                prompt = f"{system_instruction}\n\n{conversation[0]['parts'][0]}"
+                response = await asyncio.to_thread(
+                    model_instance.generate_content,
+                    prompt,
+                    generation_config=generation_config
+                )
+            else:
+                # Multi-turn conversation
+                if system_instruction:
+                    # Prepend system instruction to first user message
+                    if conversation and conversation[0]["role"] == "user":
+                        conversation[0]["parts"][0] = f"{system_instruction}\n\n{conversation[0]['parts'][0]}"
+                
+                response = await asyncio.to_thread(
+                    model_instance.generate_content,
+                    conversation,
+                    generation_config=generation_config
+                )
+            
+            # Extract response content
+            content = ""
+            if response.text:
+                content = response.text
+            
+            return content, content
+            
+        except Exception as e:
+            error_str = str(e)
+            
+            # Handle different types of errors
+            if "quota" in error_str.lower() or "rate" in error_str.lower():
+                print(f"🚫 Rate/Quota Error: {error_str}")
+                if "quota" in error_str.lower():
+                    print("⏳ Detected quota exhaustion - sleeping 15 minutes")
+                    await asyncio.sleep(900)  # 15 minutes
+                else:
+                    sleep_time = 2 + random.random()
+                    print(f"   ⏰ Rate limited, sleeping {sleep_time:.1f}s")
+                    await asyncio.sleep(sleep_time)
+                # Re-raise to trigger retry logic
+                raise
+            elif "api" in error_str.lower():
+                print(f"❌ Gemini API Error: {error_str}")
+                raise
+            else:
+                print(f"❌ Unexpected error in Gemini API call: {error_str}")
+                raise
+    
+    def get_model_info(self) -> Dict[str, str]:
+        """Get information about the configured models."""
+        return {
+            "solver_model": self.solver_model,
+            "grader_model": self.grader_model,
+            "provider": "gemini"
+        }
+    
+    async def health_check(self) -> bool:
+        """
+        Perform a simple health check to verify API connectivity.
+        
+        Returns:
+            True if API is accessible, False otherwise
+        """
+        try:
+            # Simple test call
+            test_messages = [
+                {"role": "user", "content": "Hello, please respond with a simple JSON: {\"status\": \"ok\"}"}
+            ]
+            
+            result, _ = await self._call_api(
+                model=self.solver_model,
+                messages=test_messages,
+                temperature=0.0
+            )
+            
+            if result and "ok" in result.lower():
+                print(f"✅ Gemini API health check passed for {self.solver_model}")
+                return True
+            else:
+                print(f"⚠️ Gemini API health check returned unexpected response")
+                return False
+                
+        except Exception as e:
+            print(f"❌ Gemini API health check failed: {str(e)}")
+            return False
+    
+    async def estimate_cost(self, 
+                          num_problems: int, 
+                          avg_problem_length: int = 1000,
+                          avg_solution_length: int = 2000) -> Dict[str, float]:
+        """
+        Estimate the cost for processing a given number of problems.
+        
+        Args:
+            num_problems: Number of problems to process
+            avg_problem_length: Average length of problem statements in characters
+            avg_solution_length: Average length of solutions in characters
+            
+        Returns:
+            Dictionary with cost estimates
+        """
+        # Rough token estimates (1 token ≈ 4 characters for English)
+        tokens_per_solve = (avg_problem_length + avg_solution_length) // 4
+        tokens_per_grade = (avg_problem_length + avg_solution_length * 2) // 4
+        
+        # Gemini pricing (update with actual Google AI pricing)
+        # These are rough estimates and should be updated with current pricing
+        pricing = {
+            "gemini-1.5-flash": {"input": 0.000075, "output": 0.0003},  # per 1K tokens
+            "gemini-1.5-pro": {"input": 0.00125, "output": 0.005},  # per 1K tokens
+            "gemini-1.0-pro": {"input": 0.0005, "output": 0.0015},  # per 1K tokens
+        }
+        
+        def get_model_cost(model: str, input_tokens: int, output_tokens: int) -> float:
+            if model not in pricing:
+                model = "gemini-1.5-pro"  # Default fallback
+            
+            input_cost = (input_tokens / 1000) * pricing[model]["input"]
+            output_cost = (output_tokens / 1000) * pricing[model]["output"]
+            return input_cost + output_cost
+        
+        # Calculate costs
+        solve_cost = get_model_cost(
+            self.solver_model, 
+            tokens_per_solve * num_problems,
+            tokens_per_solve * num_problems // 2  # Assume output is ~50% of input
+        )
+        
+        grade_cost = get_model_cost(
+            self.grader_model,
+            tokens_per_grade * num_problems,
+            tokens_per_grade * num_problems // 3  # Assume output is ~33% of input
+        )
+        
+        total_cost = solve_cost + grade_cost
+        
+        return {
+            "solve_cost": round(solve_cost, 4),
+            "grade_cost": round(grade_cost, 4),
+            "total_cost": round(total_cost, 4),
+            "cost_per_problem": round(total_cost / num_problems, 6),
+            "currency": "USD"
+        }
diff --git a/putnam-bench-anon/loader/hf_local.py b/putnam-bench-anon/loader/hf_local.py
new file mode 100644
index 0000000..9371436
--- /dev/null
+++ b/putnam-bench-anon/loader/hf_local.py
@@ -0,0 +1,375 @@
+"""
+Hugging Face local model loader implementation.
+Handles direct inference with locally loaded transformers models.
+"""
+
+import asyncio
+import random
+from typing import Dict, List, Tuple, Optional
+import json
+
+try:
+    import torch
+    from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
+    import transformers
+except ImportError:
+    torch = None
+    AutoModelForCausalLM = None
+    AutoTokenizer = None
+    pipeline = None
+    transformers = None
+
+from .base import ModelLoader
+
+
+class HuggingFaceModelLoader(ModelLoader):
+    """Hugging Face local model implementation of the ModelLoader."""
+    
+    def __init__(self, 
+                 solver_model: str = "microsoft/DialoGPT-medium",
+                 grader_model: str = "microsoft/DialoGPT-large",
+                 device: str = "auto",
+                 max_length: int = 4000,
+                 **kwargs):
+        """
+        Initialize Hugging Face model loader.
+        
+        Args:
+            solver_model: HuggingFace model name for solving problems
+            grader_model: HuggingFace model name for grading solutions  
+            device: Device to run models on ("auto", "cuda", "cpu")
+            max_length: Maximum generation length
+            **kwargs: Additional arguments passed to parent class
+        """
+        if transformers is None or torch is None:
+            raise ImportError(
+                "transformers and torch packages are required for HuggingFaceModelLoader. "
+                "Install with: pip install transformers torch"
+            )
+            
+        super().__init__(solver_model, grader_model, **kwargs)
+        
+        # Device setup
+        if device == "auto":
+            self.device = "cuda" if torch.cuda.is_available() else "cpu"
+        else:
+            self.device = device
+            
+        self.max_length = max_length
+        
+        # Model and tokenizer caches
+        self._models = {}
+        self._tokenizers = {}
+        self._pipelines = {}
+        
+        print(f"🔧 HuggingFace loader initialized on device: {self.device}")
+    
+    async def _load_model(self, model_name: str) -> Tuple[AutoModelForCausalLM, AutoTokenizer]:
+        """Load model and tokenizer, with caching."""
+        if model_name not in self._models:
+            print(f"📥 Loading model: {model_name}")
+            
+            try:
+                # Load in a separate thread to avoid blocking
+                tokenizer = await asyncio.to_thread(
+                    AutoTokenizer.from_pretrained, 
+                    model_name,
+                    trust_remote_code=True
+                )
+                
+                model = await asyncio.to_thread(
+                    AutoModelForCausalLM.from_pretrained,
+                    model_name,
+                    torch_dtype=torch.float16 if self.device == "cuda" else torch.float32,
+                    device_map="auto" if self.device == "cuda" else None,
+                    trust_remote_code=True
+                )
+                
+                if self.device == "cpu":
+                    model = model.to(self.device)
+                
+                # Set pad token if not present
+                if tokenizer.pad_token is None:
+                    tokenizer.pad_token = tokenizer.eos_token
+                
+                self._models[model_name] = model
+                self._tokenizers[model_name] = tokenizer
+                
+                print(f"✅ Model loaded successfully: {model_name}")
+                
+            except Exception as e:
+                print(f"❌ Failed to load model {model_name}: {str(e)}")
+                raise
+        
+        return self._models[model_name], self._tokenizers[model_name]
+    
+    async def _call_api(self, 
+                       model: str, 
+                       messages: List[Dict[str, str]], 
+                       temperature: float = 0.0) -> Tuple[Optional[str], str]:
+        """
+        Make a local inference call using the HuggingFace model.
+        
+        Args:
+            model: Model name to use
+            messages: List of messages in chat format
+            temperature: Temperature for generation
+            
+        Returns:
+            Tuple of (response_content, raw_response)
+        """
+        try:
+            # Load model and tokenizer
+            hf_model, tokenizer = await self._load_model(model)
+            
+            # Convert messages to prompt format
+            prompt = self._format_messages(messages)
+            
+            # Generate response
+            response = await self._generate_response(
+                hf_model, tokenizer, prompt, temperature
+            )
+            
+            return response, response
+            
+        except Exception as e:
+            print(f"❌ HuggingFace inference error: {str(e)}")
+            raise
+    
+    def _format_messages(self, messages: List[Dict[str, str]]) -> str:
+        """Convert OpenAI message format to a prompt string."""
+        prompt_parts = []
+        
+        for msg in messages:
+            role = msg["role"]
+            content = msg["content"]
+            
+            if role == "system":
+                prompt_parts.append(f"System: {content}")
+            elif role == "user":
+                prompt_parts.append(f"User: {content}")
+            elif role == "assistant":
+                prompt_parts.append(f"Assistant: {content}")
+        
+        prompt_parts.append("Assistant:")
+        return "\n\n".join(prompt_parts)
+    
+    async def _generate_response(self, 
+                               model: AutoModelForCausalLM,
+                               tokenizer: AutoTokenizer,
+                               prompt: str, 
+                               temperature: float) -> str:
+        """Generate response using the loaded model."""
+        
+        # Tokenize input
+        inputs = await asyncio.to_thread(
+            tokenizer.encode,
+            prompt,
+            return_tensors="pt",
+            truncation=True,
+            max_length=2048  # Leave room for generation
+        )
+        
+        if self.device == "cuda":
+            inputs = inputs.to(self.device)
+        
+        # Generation parameters
+        gen_kwargs = {
+            "max_new_tokens": min(self.max_length, 2048),
+            "temperature": max(temperature, 0.1),  # Avoid 0 temperature
+            "do_sample": temperature > 0.0,
+            "pad_token_id": tokenizer.eos_token_id,
+            "eos_token_id": tokenizer.eos_token_id,
+            "attention_mask": torch.ones_like(inputs)
+        }
+        
+        if temperature > 0.0:
+            gen_kwargs.update({
+                "top_p": 0.9,
+                "top_k": 50
+            })
+        
+        # Generate
+        with torch.no_grad():
+            outputs = await asyncio.to_thread(
+                model.generate,
+                inputs,
+                **gen_kwargs
+            )
+        
+        # Decode response
+        generated_text = await asyncio.to_thread(
+            tokenizer.decode,
+            outputs[0][inputs.shape[1]:],  # Only new tokens
+            skip_special_tokens=True
+        )
+        
+        return generated_text.strip()
+    
+    def get_model_info(self) -> Dict[str, str]:
+        """Get information about the configured models."""
+        return {
+            "solver_model": self.solver_model,
+            "grader_model": self.grader_model,
+            "provider": "huggingface",
+            "device": self.device,
+            "loaded_models": list(self._models.keys())
+        }
+    
+    async def health_check(self) -> bool:
+        """
+        Perform a simple health check by testing model loading and inference.
+        
+        Returns:
+            True if models can be loaded and run, False otherwise
+        """
+        try:
+            # Simple test
+            test_messages = [
+                {"role": "user", "content": "Hello, please say 'ok' to confirm you're working."}
+            ]
+            
+            result, _ = await self._call_api(
+                model=self.solver_model,
+                messages=test_messages,
+                temperature=0.1
+            )
+            
+            if result and len(result) > 0:
+                print(f"✅ HuggingFace health check passed for {self.solver_model}")
+                return True
+            else:
+                print(f"⚠️ HuggingFace health check returned empty response")
+                return False
+                
+        except Exception as e:
+            print(f"❌ HuggingFace health check failed: {str(e)}")
+            return False
+    
+    async def estimate_cost(self, 
+                          num_problems: int, 
+                          avg_problem_length: int = 1000,
+                          avg_solution_length: int = 2000) -> Dict[str, float]:
+        """
+        Estimate computational cost for processing problems locally.
+        
+        Args:
+            num_problems: Number of problems to process
+            avg_problem_length: Average length of problem statements in characters
+            avg_solution_length: Average length of solutions in characters
+            
+        Returns:
+            Dictionary with cost estimates (computational cost in arbitrary units)
+        """
+        # Rough token estimates (1 token ≈ 4 characters for English)
+        tokens_per_solve = (avg_problem_length + avg_solution_length) // 4
+        tokens_per_grade = (avg_problem_length + avg_solution_length * 2) // 4
+        
+        # Model size-based cost estimation (FLOPS approximation)
+        model_costs = {
+            # Small models (< 1B parameters)
+            "gpt2": 0.5,
+            "distilgpt2": 0.3,
+            "dialogpt-small": 0.4,
+            "dialogpt-medium": 0.8,
+            
+            # Medium models (1B - 10B parameters)  
+            "dialogpt-large": 1.5,
+            "gpt2-medium": 1.0,
+            "gpt2-large": 2.0,
+            "gpt2-xl": 4.0,
+            
+            # Large models (10B+ parameters)
+            "llama-7b": 8.0,
+            "llama-13b": 15.0,
+            "llama-30b": 35.0,
+            "llama-65b": 70.0,
+        }
+        
+        def get_model_cost(model: str) -> float:
+            model_lower = model.lower()
+            for key, cost in model_costs.items():
+                if key in model_lower:
+                    return cost
+            
+            # Default based on common model sizes
+            if any(size in model_lower for size in ["small", "mini"]):
+                return 0.5
+            elif any(size in model_lower for size in ["medium", "base"]):
+                return 1.0  
+            elif any(size in model_lower for size in ["large", "xl"]):
+                return 2.0
+            else:
+                return 1.5  # Default for unknown models
+        
+        # Calculate computational costs
+        solver_cost_factor = get_model_cost(self.solver_model)
+        grader_cost_factor = get_model_cost(self.grader_model)
+        
+        # Device multiplier (GPU is faster but uses more power)
+        device_multiplier = 0.3 if self.device == "cuda" else 1.0
+        
+        solve_cost = tokens_per_solve * num_problems * solver_cost_factor * device_multiplier / 1000
+        grade_cost = tokens_per_grade * num_problems * grader_cost_factor * device_multiplier / 1000
+        
+        total_cost = solve_cost + grade_cost
+        
+        return {
+            "solve_cost": round(solve_cost, 4),
+            "grade_cost": round(grade_cost, 4), 
+            "total_cost": round(total_cost, 4),
+            "cost_per_problem": round(total_cost / num_problems, 6),
+            "currency": "computational_units",
+            "device": self.device,
+            "note": "Local HuggingFace costs are computational (time/energy/memory)"
+        }
+    
+    async def unload_model(self, model_name: str) -> bool:
+        """
+        Unload a specific model to free memory.
+        
+        Args:
+            model_name: Name of the model to unload
+            
+        Returns:
+            True if successfully unloaded, False otherwise
+        """
+        try:
+            if model_name in self._models:
+                del self._models[model_name]
+                del self._tokenizers[model_name]
+                
+                # Force garbage collection
+                if torch.cuda.is_available():
+                    torch.cuda.empty_cache()
+                
+                print(f"🗑️ Unloaded model: {model_name}")
+                return True
+            else:
+                print(f"⚠️ Model not loaded: {model_name}")
+                return False
+                
+        except Exception as e:
+            print(f"❌ Error unloading model {model_name}: {str(e)}")
+            return False
+    
+    async def unload_all_models(self) -> bool:
+        """
+        Unload all models to free memory.
+        
+        Returns:
+            True if all models successfully unloaded
+        """
+        try:
+            model_names = list(self._models.keys())
+            success = True
+            
+            for model_name in model_names:
+                if not await self.unload_model(model_name):
+                    success = False
+            
+            return success
+            
+        except Exception as e:
+            print(f"❌ Error unloading all models: {str(e)}")
+            return False
diff --git a/putnam-bench-anon/loader/openai_client.py b/putnam-bench-anon/loader/openai_client.py
new file mode 100644
index 0000000..fcbe247
--- /dev/null
+++ b/putnam-bench-anon/loader/openai_client.py
@@ -0,0 +1,603 @@
+"""
+OpenAI model loader implementation.
+Handles API calls to OpenAI models with proper error handling and retry logic.
+"""
+
+import asyncio
+import random
+from typing import Dict, List, Tuple, Optional
+import os # Added for KimiModelLoader
+
+from openai import AsyncOpenAI, RateLimitError, APIError, APIConnectionError, BadRequestError
+
+from .base import ModelLoader
+from .prompts import RESPONSE_FORMAT
+
+
+class OpenAIModelLoader(ModelLoader):
+    """OpenAI implementation of the ModelLoader."""
+    
+    def __init__(self, 
+                 solver_model: str = "gpt-4o-mini",
+                 grader_model: str = "o3",
+                 api_key: Optional[str] = None,
+                 base_url: Optional[str] = None,
+                 **kwargs):
+        """
+        Initialize OpenAI model loader.
+        
+        Args:
+            solver_model: OpenAI model for solving problems (default: gpt-4o-mini)
+            grader_model: OpenAI model for grading solutions (default: o3)
+            api_key: OpenAI API key (if None, uses environment variable)
+            base_url: Custom base URL for OpenAI API
+            **kwargs: Additional arguments passed to parent class
+        """
+        super().__init__(solver_model, grader_model, **kwargs)
+        
+        # Initialize OpenAI client with custom httpx client for high concurrency
+        client_kwargs = {}
+        if api_key:
+            client_kwargs["api_key"] = api_key
+        if base_url:
+            client_kwargs["base_url"] = base_url
+        
+        # Configure httpx for high concurrency
+        import httpx
+        limits = httpx.Limits(
+            max_connections=1000,  # Total connection pool size
+            max_keepalive_connections=500,  # Persistent connections
+            keepalive_expiry=30.0  # Keep connections alive for 30s
+        )
+        timeout = httpx.Timeout(
+            timeout=600.0,  # Overall timeout (increased from 300)
+            connect=60.0,   # Connection timeout
+            read=600.0,     # Read timeout (increased from 300)
+            write=60.0      # Write timeout
+        )
+        
+        http_client = httpx.AsyncClient(
+            limits=limits,
+            timeout=timeout
+        )
+        client_kwargs["http_client"] = http_client
+            
+        self.client = AsyncOpenAI(**client_kwargs)
+        self._http_client = http_client  # Keep reference to close later
+    
+    async def __aenter__(self):
+        """Async context manager entry."""
+        return self
+    
+    async def __aexit__(self, exc_type, exc_val, exc_tb):
+        """Async context manager exit - close http client."""
+        if hasattr(self, '_http_client'):
+            await self._http_client.aclose()
+    
+    async def _call_api(self, 
+                       model: str, 
+                       messages: List[Dict[str, str]], 
+                       temperature: float = 0.0) -> Tuple[Optional[str], str]:
+        """
+        Make an API call to OpenAI.
+        
+        Args:
+            model: OpenAI model name
+            messages: List of messages in chat format
+            temperature: Temperature for generation
+            
+        Returns:
+            Tuple of (response_content, raw_response)
+        """
+        try:
+            # Override temperature for models that require it
+            # o1, o3, o3-mini, and o4-mini only support temperature 1.0
+            if any(model_name in model.lower() for model_name in ['o1', 'o3', 'o3-mini', 'o4-mini']):
+                actual_temperature = 1.0
+                if self.debug and temperature != 1.0:
+                    print(f"⚠️  Overriding temperature from {temperature} to 1.0 for model {model}")
+            else:
+                actual_temperature = temperature
+            
+            # Prepare API call parameters
+            api_params = {
+                "model": model,
+                "messages": messages,
+                "temperature": actual_temperature,
+                # Set max_tokens to avoid truncation
+                # Most OpenAI models support at least 4096, newer ones support much more
+                "max_tokens": 32000,  # High default that works for GPT-4 and newer models
+            }
+            
+            # Only add response_format for models that support it
+            # o1 models and some older models don't support JSON format
+            # Note: o3 and o3-mini DO support response_format (tested and confirmed)
+            if not (model.startswith("o1") or model in ["gpt-4", "gpt-3.5-turbo"]):
+                api_params["response_format"] = RESPONSE_FORMAT
+            
+            # Remove max_tokens for models that don't support it
+            # o1 and o3 models don't support max_tokens parameter
+            if model.startswith("o1") or model.startswith("o3"):
+                api_params.pop("max_tokens", None)
+            
+            # Make the API call
+            response = await self.client.chat.completions.create(**api_params)
+            
+            # Extract response content
+            content = response.choices[0].message.content or ""
+            
+            return content, content
+            
+        except RateLimitError as e:
+            # Handle rate limiting with special logic
+            error_str = str(e)
+            if self.debug:
+                print(f"🚫 RateLimitError: {error_str}")
+            
+            if "insufficient_quota" in error_str:
+                if self.debug:
+                    print("⏳ Detected quota exhaustion - sleeping 15 minutes")
+                await asyncio.sleep(900)  # 15 minutes
+            else:
+                # Standard rate limit - shorter sleep
+                sleep_time = 2 + random.random()
+                if self.debug:
+                    print(f"   ⏰ Rate limited, sleeping {sleep_time:.1f}s")
+                await asyncio.sleep(sleep_time)
+            
+            # Re-raise to trigger retry logic
+            raise
+            
+        except BadRequestError as e:
+            # Handle policy violations and other 400 errors with special logic
+            error_str = str(e)
+            if self.debug:
+                print(f"🚫 BadRequestError: {error_str}")
+            
+            if "usage policy" in error_str or "flagged" in error_str:
+                if self.debug:
+                    print("⏳ Detected policy violation - sleeping 30 seconds before retry")
+                await asyncio.sleep(30)  # Longer delay for policy violations
+            else:
+                # Standard bad request - shorter sleep
+                sleep_time = 5 + random.random()
+                if self.debug:
+                    print(f"   ⏰ Bad request error, sleeping {sleep_time:.1f}s")
+                await asyncio.sleep(sleep_time)
+            
+            # Re-raise to trigger retry logic
+            raise
+            
+        except (APIError, APIConnectionError) as e:
+            if self.debug:
+                print(f"❌ OpenAI API Error: {str(e)}")
+            raise
+            
+        except Exception as e:
+            if self.debug:
+                print(f"❌ Unexpected error in OpenAI API call: {str(e)}")
+            raise
+    
+    def get_model_info(self) -> Dict[str, str]:
+        """Get information about the configured models."""
+        return {
+            "solver_model": self.solver_model,
+            "grader_model": self.grader_model,
+            "provider": "openai"
+        }
+    
+    async def health_check(self) -> bool:
+        """
+        Perform a simple health check to verify API connectivity.
+        
+        Returns:
+            True if API is accessible, False otherwise
+        """
+        try:
+            # Simple test call
+            test_messages = [
+                {"role": "user", "content": "Hello, please respond with a simple JSON: {\"status\": \"ok\"}"}
+            ]
+            
+            # Set temperature based on model
+            # o1, o3, o3-mini, and o4-mini require temperature 1.0
+            if any(model_name in self.solver_model.lower() for model_name in ['o1', 'o3', 'o3-mini', 'o4-mini']):
+                temperature = 1.0
+            else:
+                # Use temperature 0.0 for deterministic results with other models
+                temperature = 0.0
+            
+            result, _ = await self._call_api(
+                model=self.solver_model,
+                messages=test_messages,
+                temperature=temperature
+            )
+            
+            if result and "ok" in result.lower():
+                if self.debug:
+                    print(f"✅ OpenAI API health check passed for {self.solver_model}")
+                return True
+            else:
+                if self.debug:
+                    print(f"⚠️ OpenAI API health check returned unexpected response")
+                return False
+                
+        except Exception as e:
+            if self.debug:
+                print(f"❌ OpenAI API health check failed: {str(e)}")
+            return False
+    
+    async def estimate_cost(self, 
+                          num_problems: int, 
+                          avg_problem_length: int = 1000,
+                          avg_solution_length: int = 2000) -> Dict[str, float]:
+        """
+        Estimate the cost for processing a given number of problems.
+        
+        Args:
+            num_problems: Number of problems to process
+            avg_problem_length: Average length of problem statements in characters
+            avg_solution_length: Average length of solutions in characters
+            
+        Returns:
+            Dictionary with cost estimates
+        """
+        # Rough token estimates (1 token ≈ 4 characters for English)
+        tokens_per_solve = (avg_problem_length + avg_solution_length) // 4
+        tokens_per_grade = (avg_problem_length + avg_solution_length * 2) // 4
+        
+        # Simplified pricing (update with actual OpenAI pricing)
+        # These are rough estimates and should be updated with current pricing
+        pricing = {
+            "gpt-4o-mini": {"input": 0.00015, "output": 0.0006},  # per 1K tokens
+            "o3": {"input": 0.03, "output": 0.12},  # per 1K tokens (estimated)
+            "gpt-4": {"input": 0.03, "output": 0.06},  # per 1K tokens
+        }
+        
+        def get_model_cost(model: str, input_tokens: int, output_tokens: int) -> float:
+            if model not in pricing:
+                model = "gpt-4"  # Default fallback
+            
+            input_cost = (input_tokens / 1000) * pricing[model]["input"]
+            output_cost = (output_tokens / 1000) * pricing[model]["output"]
+            return input_cost + output_cost
+        
+        # Calculate costs
+        solve_cost = get_model_cost(
+            self.solver_model, 
+            tokens_per_solve * num_problems,
+            tokens_per_solve * num_problems // 2  # Assume output is ~50% of input
+        )
+        
+        grade_cost = get_model_cost(
+            self.grader_model,
+            tokens_per_grade * num_problems,
+            tokens_per_grade * num_problems // 3  # Assume output is ~33% of input
+        )
+        
+        total_cost = solve_cost + grade_cost
+        
+        return {
+            "solve_cost": round(solve_cost, 4),
+            "grade_cost": round(grade_cost, 4),
+            "total_cost": round(total_cost, 4),
+            "cost_per_problem": round(total_cost / num_problems, 6),
+            "currency": "USD"
+        }
+
+
+class KimiModelLoader(OpenAIModelLoader):
+    """Kimi/Moonshot implementation using OpenAI-compatible API."""
+    
+    def __init__(self, 
+                 solver_model: str = "kimi-k2-0711-preview",
+                 grader_model: str = "kimi-k2-0711-preview",
+                 api_key: Optional[str] = None,
+                 **kwargs):
+        """
+        Initialize Kimi model loader.
+        
+        Args:
+            solver_model: Kimi model for solving problems (default: moonshot-v1-8k)
+            grader_model: Kimi model for grading solutions (default: moonshot-v1-8k)
+            api_key: Kimi API key (if None, uses MOONSHOT_API_KEY environment variable)
+            **kwargs: Additional arguments passed to parent class
+        """
+        # Get API key from parameter or environment
+        if api_key is None:
+            api_key = os.getenv('MOONSHOT_API_KEY')
+        
+        # Initialize with Kimi-specific settings
+        super().__init__(
+            solver_model=solver_model,
+            grader_model=grader_model,
+            api_key=api_key,
+            base_url="https://api.moonshot.ai/v1",
+            **kwargs
+        )
+    
+    async def _call_api(self, 
+                       model: str, 
+                       messages: List[Dict[str, str]], 
+                       temperature: float = 0.0) -> Tuple[Optional[str], str]:
+        """
+        Make an API call to Kimi with proper error handling.
+        
+        Args:
+            model: Kimi model name
+            messages: List of messages in chat format
+            temperature: Temperature for generation
+            
+        Returns:
+            Tuple of (response_content, raw_response)
+        """
+        import time
+        
+        start_time = time.time()
+        if self.debug:
+            print(f"🔄 Starting Kimi API call with model: {model}")
+        
+        try:
+            # Prepare API call parameters
+            api_params = {
+                "model": model,
+                "messages": messages,
+                "temperature": temperature,
+                "response_format": RESPONSE_FORMAT,  # Kimi supports JSON format
+            }
+            
+            # Set max_tokens based on model
+            if "128k" in model:
+                api_params["max_tokens"] = 32000  # For 128k context models
+            elif "32k" in model:
+                api_params["max_tokens"] = 16000  # For 32k context models  
+            elif "8k" in model:
+                api_params["max_tokens"] = 8000   # For 8k context models
+            elif "k2" in model.lower():
+                api_params["max_tokens"] = 24000  # For K2 models
+            else:
+                api_params["max_tokens"] = 16000  # Default high limit
+            
+            if self.debug:
+                print(f"📋 API call parameters: model={model}, messages={len(messages)}, temp={temperature}, max_tokens={api_params['max_tokens']}")
+            
+            # Make the API call
+            response = await self.client.chat.completions.create(**api_params)
+            
+            elapsed_time = time.time() - start_time
+            if self.debug:
+                print(f"✅ Kimi API call completed in {elapsed_time:.2f}s")
+            
+            # Extract response content
+            content = response.choices[0].message.content or ""
+            if self.debug:
+                print(f"📄 Response length: {len(content)} characters")
+            
+            # Check if response might be truncated
+            if self.debug and hasattr(response, 'usage'):
+                completion_tokens = response.usage.completion_tokens
+                print(f"📊 Completion tokens used: {completion_tokens}")
+                if completion_tokens >= api_params['max_tokens'] * 0.95:  # 95% of limit
+                    print(f"⚠️  WARNING: Response may be truncated (used {completion_tokens}/{api_params['max_tokens']} tokens)")
+            
+            # Check if content ends abruptly (truncation signs)
+            if self.debug and content and not content.strip().endswith(('"}', '"}')):
+                print("⚠️  WARNING: Response doesn't end with proper JSON closure - likely truncated")
+            
+            # ============= RAW RESPONSE LOGGING (DEBUG ONLY) =============
+            if self.debug:
+                import json
+                from pathlib import Path
+                from datetime import datetime
+                
+                # Create raw response log directory
+                log_dir = Path("kimi_raw_responses")
+                log_dir.mkdir(exist_ok=True)
+                
+                # Save raw response
+                timestamp = datetime.now().strftime('%Y%m%d_%H%M%S_%f')[:-3]  # Include milliseconds
+                raw_log_file = log_dir / f"kimi_raw_response_{timestamp}.json"
+                
+                raw_response_data = {
+                    "timestamp": datetime.now().isoformat(),
+                    "model": model,
+                    "api_params": api_params,
+                    "response_time_seconds": elapsed_time,
+                    "raw_content": content,
+                    "content_length": len(content),
+                    "response_object": {
+                        "choices": [
+                            {
+                                "message": {
+                                    "content": content,
+                                    "role": response.choices[0].message.role
+                                }
+                            }
+                        ]
+                    }
+                }
+                
+                try:
+                    with open(raw_log_file, 'w', encoding='utf-8') as f:
+                        json.dump(raw_response_data, f, indent=2, ensure_ascii=False)
+                    print(f"💾 Raw response saved to: {raw_log_file}")
+                except Exception as save_error:
+                    print(f"❌ Failed to save raw response: {save_error}")
+                
+                # Also print raw content to console
+                print(f"📋 RAW RESPONSE CONTENT:")
+                print(f"{'='*60}")
+                print(content[:1000] + ("..." if len(content) > 1000 else ""))
+                print(f"{'='*60}")
+            # ============= END RAW RESPONSE LOGGING =============
+            
+            return content, content
+            
+        except RateLimitError as e:
+            elapsed_time = time.time() - start_time
+            error_str = str(e)
+            if self.debug:
+                print(f"🚫 Kimi RateLimitError after {elapsed_time:.2f}s: {error_str}")
+            
+            # Try to capture response details
+            if self.debug and hasattr(e, 'response') and e.response:
+                print(f"   Status: {e.response.status_code}")
+                print(f"   Headers: {dict(e.response.headers)}")
+                print(f"   Response: {e.response.text[:500]}...")
+            
+            if "insufficient_quota" in error_str:
+                if self.debug:
+                    print("⏳ Detected Kimi quota exhaustion - sleeping 15 minutes")
+                await asyncio.sleep(900)  # 15 minutes
+            else:
+                # Standard rate limit - shorter sleep
+                sleep_time = 2 + random.random()
+                if self.debug:
+                    print(f"   ⏰ Rate limited on Kimi API, sleeping {sleep_time:.1f}s")
+                await asyncio.sleep(sleep_time)
+            
+            # Re-raise to trigger retry logic
+            raise
+            
+        except (APIError, APIConnectionError) as e:
+            elapsed_time = time.time() - start_time
+            error_str = str(e)
+            if self.debug:
+                print(f"❌ Kimi API Error after {elapsed_time:.2f}s: {error_str}")
+            
+            # Try to capture response details
+            if self.debug and hasattr(e, 'response') and e.response:
+                print(f"   Status: {e.response.status_code}")
+                print(f"   Headers: {dict(e.response.headers)}")
+                print(f"   Response: {e.response.text[:500]}...")
+            
+            # Log request details for debugging
+            if self.debug and hasattr(e, 'request') and e.request:
+                print(f"   Request URL: {e.request.url}")
+                print(f"   Request method: {e.request.method}")
+                print(f"   Request headers: {dict(e.request.headers)}")
+            
+            raise
+            
+        except Exception as e:
+            elapsed_time = time.time() - start_time
+            error_str = str(e)
+            if self.debug:
+                print(f"❌ Unexpected error in Kimi API call after {elapsed_time:.2f}s: {error_str}")
+                print(f"   Error type: {type(e).__name__}")
+            
+            # Try to capture any additional error details
+            if self.debug and hasattr(e, 'response'):
+                try:
+                    print(f"   Response status: {e.response.status_code}")
+                    print(f"   Response headers: {dict(e.response.headers)}")
+                    print(f"   Response text: {e.response.text[:500]}...")
+                except:
+                    print("   Could not extract response details")
+            
+            # Log the full exception
+            if self.debug:
+                import traceback
+                print(f"   Full traceback: {traceback.format_exc()}")
+            
+            raise
+    
+    def get_model_info(self) -> Dict[str, str]:
+        """Get information about the configured models."""
+        return {
+            "solver_model": self.solver_model,
+            "grader_model": self.grader_model,
+            "provider": "kimi",
+            "base_url": "https://api.moonshot.ai/v1"
+        }
+    
+    async def health_check(self) -> bool:
+        """
+        Perform a simple health check to verify Kimi API connectivity.
+        
+        Returns:
+            True if API is accessible, False otherwise
+        """
+        try:
+            # Simple test call with Kimi's system prompt
+            test_messages = [
+                {"role": "system", "content": "You are Kimi, an AI assistant provided by Moonshot AI. You are proficient in Chinese and English conversations. You provide users with safe, helpful, and accurate answers. You will reject any questions involving terrorism, racism, or explicit content. Moonshot AI is a proper noun and should not be translated."},
+                {"role": "user", "content": "Hello, please respond with a simple JSON: {\"status\": \"ok\"}"}
+            ]
+            
+            result, _ = await self._call_api(
+                model=self.solver_model,
+                messages=test_messages,
+                temperature=0.0
+            )
+            
+            if result and "ok" in result.lower():
+                if self.debug:
+                    print(f"✅ Kimi API health check passed for {self.solver_model}")
+                return True
+            else:
+                if self.debug:
+                    print(f"⚠️ Kimi API health check returned unexpected response")
+                return False
+                
+        except Exception as e:
+            if self.debug:
+                print(f"❌ Kimi API health check failed: {str(e)}")
+            return False
+    
+    async def estimate_cost(self, 
+                          num_problems: int, 
+                          avg_problem_length: int = 1000,
+                          avg_solution_length: int = 2000) -> Dict[str, float]:
+        """
+        Estimate the cost for processing a given number of problems with Kimi models.
+        
+        Args:
+            num_problems: Number of problems to process
+            avg_problem_length: Average length of problem statements in characters
+            avg_solution_length: Average length of solutions in characters
+            
+        Returns:
+            Dictionary with cost estimates
+        """
+        # Rough token estimates (1 token ≈ 4 characters for English)
+        tokens_per_solve = (avg_problem_length + avg_solution_length) // 4
+        tokens_per_grade = (avg_problem_length + avg_solution_length * 2) // 4
+        
+        # Kimi pricing (in USD per 1K tokens)
+        # These are example prices - update with actual Kimi pricing
+        pricing = {
+            "moonshot-v1-8k": {"input": 0.012, "output": 0.012},
+            "moonshot-v1-32k": {"input": 0.024, "output": 0.024},
+            "moonshot-v1-128k": {"input": 0.06, "output": 0.06},
+        }
+        
+        def get_model_cost(model: str, input_tokens: int, output_tokens: int) -> float:
+            if model not in pricing:
+                model = "moonshot-v1-8k"  # Default to 8k pricing
+            
+            input_cost = (input_tokens / 1000) * pricing[model]["input"]
+            output_cost = (output_tokens / 1000) * pricing[model]["output"]
+            return input_cost + output_cost
+        
+        # Calculate costs
+        solve_cost = get_model_cost(
+            self.solver_model, 
+            tokens_per_solve * num_problems,
+            tokens_per_solve * num_problems // 2  # Assume output is ~50% of input
+        )
+        
+        grade_cost = get_model_cost(
+            self.grader_model,
+            tokens_per_grade * num_problems,
+            tokens_per_grade * num_problems // 4  # Grading output is shorter
+        )
+        
+        return {
+            "solver_cost": solve_cost,
+            "grader_cost": grade_cost,
+            "total_cost": solve_cost + grade_cost,
+            "num_problems": num_problems,
+            "solver_model": self.solver_model,
+            "grader_model": self.grader_model
+        }
diff --git a/putnam-bench-anon/loader/openrouter_client.py b/putnam-bench-anon/loader/openrouter_client.py
new file mode 100644
index 0000000..13cd7fa
--- /dev/null
+++ b/putnam-bench-anon/loader/openrouter_client.py
@@ -0,0 +1,213 @@
+"""
+OpenRouter model loader implementation.
+Handles API calls to OpenRouter service using OpenAI-compatible interface.
+OpenRouter provides access to multiple model providers through a single API.
+"""
+
+import os
+from typing import Dict, Optional, List, Tuple
+
+from .openai_client import OpenAIModelLoader
+
+
+class OpenRouterModelLoader(OpenAIModelLoader):
+    """OpenRouter implementation using OpenAI-compatible API."""
+    
+    def __init__(self, 
+                 solver_model: str = "openai/gpt-4o",
+                 grader_model: str = "openai/gpt-4o",
+                 api_key: Optional[str] = None,
+                 site_url: Optional[str] = None,
+                 site_name: Optional[str] = None,
+                 **kwargs):
+        """
+        Initialize OpenRouter model loader.
+        
+        Args:
+            solver_model: Model for solving problems (default: openai/gpt-4o)
+                        Format should be "provider/model-name" (e.g., "openai/gpt-4o", "anthropic/claude-3-opus")
+            grader_model: Model for grading solutions (default: openai/gpt-4o)
+                        Format should be "provider/model-name"
+            api_key: OpenRouter API key (if None, uses OPENROUTER_API_KEY environment variable)
+            site_url: Optional site URL for rankings on openrouter.ai
+            site_name: Optional site name for rankings on openrouter.ai
+            **kwargs: Additional arguments passed to parent class
+        """
+        # Get API key from parameter or environment
+        if api_key is None:
+            api_key = os.getenv('OPENROUTER_API_KEY')
+            if not api_key:
+                raise ValueError("OpenRouter API key not provided. Set OPENROUTER_API_KEY environment variable or pass api_key parameter")
+        
+        # Store site information for headers
+        self.site_url = site_url
+        self.site_name = site_name
+        
+        # Initialize with OpenRouter-specific settings
+        super().__init__(
+            solver_model=solver_model,
+            grader_model=grader_model,
+            api_key=api_key,
+            base_url="https://openrouter.ai/api/v1",
+            **kwargs
+        )
+    
+    async def _call_api(self, 
+                       model: str, 
+                       messages: List[Dict[str, str]], 
+                       temperature: float = 0.0) -> Tuple[Optional[str], str]:
+        """
+        Make an API call to OpenRouter with proper headers.
+        
+        Args:
+            model: Model name in format "provider/model-name"
+            messages: List of messages in chat format
+            temperature: Temperature for generation
+            
+        Returns:
+            Tuple of (response_content, raw_response)
+        """
+        try:
+            # Prepare extra headers for OpenRouter
+            extra_headers = {}
+            if self.site_url:
+                extra_headers["HTTP-Referer"] = self.site_url
+            if self.site_name:
+                extra_headers["X-Title"] = self.site_name
+            
+            # Prepare API call parameters
+            api_params = {
+                "model": model,
+                "messages": messages,
+                "temperature": temperature,
+                # Set max_tokens to avoid truncation, especially for models like Gemini
+                # 32000 is a reasonable default that works for most models
+                "max_tokens": 32000,
+            }
+            
+            # Add response_format for all models - OpenRouter handles compatibility
+            from .prompts import RESPONSE_FORMAT
+            api_params["response_format"] = RESPONSE_FORMAT
+            
+            # Make the API call with extra headers
+            if extra_headers:
+                response = await self.client.chat.completions.create(
+                    **api_params,
+                    extra_headers=extra_headers
+                )
+            else:
+                response = await self.client.chat.completions.create(**api_params)
+            
+            # Check if response is valid
+            if not response or not response.choices or len(response.choices) == 0:
+                raise ValueError("Empty response from OpenRouter API")
+            
+            content = response.choices[0].message.content
+            if not content:
+                raise ValueError("Empty content in OpenRouter API response")
+                
+            return content, content
+            
+        except Exception as e:
+            # Replace "OpenAI" with "OpenRouter" in error messages
+            error_msg = str(e)
+            if "OpenAI API Error" in error_msg:
+                error_msg = error_msg.replace("OpenAI API Error", "OpenRouter API Error")
+            
+            # Log with OpenRouter-specific prefix
+            if "RateLimitError" in type(e).__name__:
+                print(f"🚫 OpenRouter RateLimitError: {error_msg}")
+                raise
+            elif "APIError" in type(e).__name__ or "APIConnectionError" in type(e).__name__:
+                print(f"❌ OpenRouter API Error: {error_msg}")
+                raise
+            else:
+                print(f"❌ Unexpected error in OpenRouter API call: {error_msg}")
+                raise
+    
+    def get_model_info(self) -> Dict[str, str]:
+        """Get information about the configured models."""
+        return {
+            "solver_model": self.solver_model,
+            "grader_model": self.grader_model,
+            "provider": "openrouter",
+            "base_url": "https://openrouter.ai/api/v1"
+        }
+    
+    async def health_check(self) -> bool:
+        """
+        Perform a simple health check to verify OpenRouter API connectivity.
+        
+        Returns:
+            True if API is accessible, False otherwise
+        """
+        try:
+            # Simple test call
+            test_messages = [
+                {"role": "user", "content": "Hello, please respond with a simple JSON: {\"status\": \"ok\"}"}
+            ]
+            
+            result, _ = await self._call_api(
+                self.solver_model,
+                test_messages,
+                temperature=0.0
+            )
+            
+            return result is not None
+            
+        except Exception as e:
+            print(f"❌ OpenRouter health check failed: {e}")
+            return False
+    
+    @staticmethod
+    def get_available_models() -> List[str]:
+        """
+        Get a list of commonly available models on OpenRouter.
+        Note: This is not exhaustive. Check https://openrouter.ai/models for full list.
+        
+        Returns:
+            List of model identifiers in "provider/model-name" format
+        """
+        return [
+            # OpenAI models
+            "openai/gpt-4o",
+            "openai/gpt-4o-mini", 
+            "openai/gpt-4-turbo",
+            "openai/gpt-3.5-turbo",
+            "openai/o1-preview",
+            "openai/o1-mini",
+            
+            # Anthropic models
+            "anthropic/claude-3-opus",
+            "anthropic/claude-3-sonnet",
+            "anthropic/claude-3-haiku",
+            "anthropic/claude-2.1",
+            "anthropic/claude-2",
+            
+            # Google models
+            "google/gemini-pro",
+            "google/gemini-pro-vision",
+            "google/palm-2-codechat-bison",
+            "google/palm-2-chat-bison",
+            
+            # Meta models
+            "meta-llama/llama-3-70b-instruct",
+            "meta-llama/llama-3-8b-instruct",
+            "meta-llama/codellama-70b-instruct",
+            
+            # Mistral models
+            "mistralai/mistral-large",
+            "mistralai/mistral-medium",
+            "mistralai/mistral-small",
+            "mistralai/mistral-7b-instruct",
+            "mistralai/mixtral-8x7b-instruct",
+            
+            # Other notable models
+            "cohere/command-r-plus",
+            "cohere/command-r",
+            "databricks/dbrx-instruct",
+            "deepseek/deepseek-coder",
+            "deepseek/deepseek-chat",
+            "qwen/qwen-2-72b-instruct",
+            "qwen/qwen-1.5-110b-chat",
+        ] 
+\ No newline at end of file
diff --git a/putnam-bench-anon/loader/prompts.py b/putnam-bench-anon/loader/prompts.py
new file mode 100644
index 0000000..7f1be83
--- /dev/null
+++ b/putnam-bench-anon/loader/prompts.py
@@ -0,0 +1,106 @@
+"""
+Prompt templates for mathematical problem solving and grading.
+These prompts have been refined and validated through extensive testing.
+"""
+
+# Solver system prompt - 4o-mini
+SOLVER_SYSTEM_PROMPT = """You are an expert mathematician solving competition-level problems.
+Provide detailed, step-by-step solutions with clear mathematical reasoning.
+
+Requirements:
+- Show all your work and intermediate steps
+- Justify each major step of your reasoning
+- Use proper mathematical notation
+- Be thorough but concise
+- State your final answer clearly
+
+Solve the problem completely and rigorously."""
+
+SOLVER_USER_TEMPLATE = """Please solve this mathematical problem:
+
+{problem_statement}
+
+Provide a complete solution with detailed reasoning. Return your response in JSON format:
+{{"solution": "your complete step-by-step solution with mathematical reasoning",
+  "final_answer": "your final answer in a clear, concise form"}}"""
+
+# Proof strict grading system prompt - o3
+PROOF_GRADER_SYSTEM_PROMPT = """You are an extremely strict mathematical grader evaluating competition-level PROOF problems.
+
+GRADING STANDARDS (BE VERY STRICT):
+- Mathematical rigor: Every step must be mathematically sound and justified
+- Logical flow: The reasoning must be clear, complete, and logically connected
+- Correctness: All calculations, algebraic manipulations, and conclusions must be correct
+- Completeness: The solution must address all parts of the problem fully
+- Precision: Mathematical statements must be precise and unambiguous
+
+FAILING CRITERIA (Mark as INCORRECT if ANY of these apply):
+- Any unjustified logical leap or gap in reasoning
+- Any computational error, no matter how small
+- Missing steps in critical parts of the argument
+- Imprecise or ambiguous mathematical statements
+- Incorrect final answer, even if approach is partially correct
+- Circular reasoning or logical fallacies
+- Misuse of mathematical theorems or definitions
+
+BE EXTREMELY STRICT. Competition mathematics proofs require perfect precision."""
+
+# Calculation lenient grading system prompt - o3  
+CALCULATION_GRADER_SYSTEM_PROMPT = """You are a mathematical grader evaluating competition-level CALCULATION problems.
+
+GRADING STANDARDS FOR CALCULATION PROBLEMS:
+- Primary focus: Is the final answer correct?
+- Secondary focus: Is the overall approach reasonable and mathematically sound?
+- Computation: Allow minor computational slips if the method is correct and final answer is right
+
+GRADING CRITERIA:
+- CORRECT: Final answer is correct AND approach is fundamentally sound
+- INCORRECT: Final answer is wrong OR approach is fundamentally flawed
+
+For calculation problems, the final numerical answer is the most important criterion.
+Minor intermediate errors are acceptable if they don't affect the final result."""
+
+PROOF_GRADER_USER_TEMPLATE = """Grade this PROOF solution with extreme strictness.
+
+PROBLEM:
+{problem_statement}
+
+STUDENT SOLUTION:
+{solution}
+
+CORRECT REFERENCE SOLUTION:
+{reference_solution}
+
+Evaluate with maximum strictness. Every logical step must be perfect. Return JSON with:
+{{"grade": "CORRECT" or "INCORRECT",
+  "detailed_feedback": "specific detailed analysis of what is right/wrong",
+  "major_issues": "list of significant mathematical errors or gaps",
+  "final_answer_correct": true or false,
+  "reasoning_rigor_score": 0-10 integer (10=perfect rigor, 0=severely flawed),
+  "overall_assessment": "comprehensive evaluation summary"}}"""
+
+CALCULATION_GRADER_USER_TEMPLATE = """Grade this CALCULATION solution with focus on final answer correctness.
+
+PROBLEM:
+{problem_statement}
+
+STUDENT SOLUTION:
+{solution}
+
+CORRECT REFERENCE SOLUTION:
+{reference_solution}
+
+Focus primarily on whether the final answer is correct. Return JSON with:
+{{"grade": "CORRECT" or "INCORRECT",
+  "detailed_feedback": "specific detailed analysis of what is right/wrong",
+  "major_issues": "list of significant mathematical errors or gaps",
+  "final_answer_correct": true or false,
+  "reasoning_rigor_score": 0-10 integer (10=perfect rigor, 0=severely flawed),
+  "overall_assessment": "comprehensive evaluation summary"}}"""
+
+# Response format for JSON output
+RESPONSE_FORMAT = {"type": "json_object"}
+
+# Default retry and timeout settings
+DEFAULT_RETRIES = 6  # Limited to 6 retries before marking as failed
+DEFAULT_TIMEOUT_BASE = 600 
+\ No newline at end of file
diff --git a/putnam-bench-anon/loader/vllm_direct.py b/putnam-bench-anon/loader/vllm_direct.py
new file mode 100644
index 0000000..b35d99b
--- /dev/null
+++ b/putnam-bench-anon/loader/vllm_direct.py
@@ -0,0 +1,313 @@
+"""
+VLLM direct Python API model loader implementation.
+Uses VLLM's Python API directly without requiring a separate server process.
+"""
+
+import asyncio
+import json
+import re
+from typing import Dict, List, Tuple, Optional, Any
+import torch
+
+try:
+    from vllm import LLM, SamplingParams
+    VLLM_AVAILABLE = True
+except ImportError:
+    LLM = None
+    SamplingParams = None
+    VLLM_AVAILABLE = False
+
+from .base import ModelLoader
+from .prompts import SOLVER_SYSTEM_PROMPT, PROOF_GRADER_SYSTEM_PROMPT
+
+
+class VLLMDirectModelLoader(ModelLoader):
+    """VLLM direct Python API implementation of the ModelLoader."""
+    
+    def __init__(self, 
+                 solver_model: str = "gpt2",
+                 grader_model: str = "gpt2",
+                 max_model_len: int = 512,
+                 gpu_memory_utilization: float = 0.4,
+                 device: str = "auto",
+                 **kwargs):
+        """
+        Initialize VLLM direct model loader.
+        
+        Args:
+            solver_model: Model name for solving problems (default: gpt2)
+            grader_model: Model name for grading solutions (default: gpt2)
+            max_model_len: Maximum sequence length (default: 512 for testing)
+            gpu_memory_utilization: GPU memory utilization ratio (default: 0.4)
+            device: Device to use ('auto', 'cuda', 'cpu')
+            **kwargs: Additional arguments passed to parent class
+        """
+        if not VLLM_AVAILABLE:
+            raise ImportError(
+                "vllm package is required for VLLMDirectModelLoader. "
+                "Install with: pip install vllm"
+            )
+            
+        super().__init__(solver_model, grader_model, **kwargs)
+        
+        self.max_model_len = max_model_len
+        self.gpu_memory_utilization = gpu_memory_utilization
+        self.device = device
+        
+        # Model instances (lazy loaded)
+        self._solver_llm = None
+        self._grader_llm = None
+        self._loaded_models = []
+        
+        print(f"🔧 VLLM Direct loader initialized")
+        print(f"   Device: {device}")
+        print(f"   Max length: {max_model_len}")
+        print(f"   GPU utilization: {gpu_memory_utilization}")
+    
+    def _get_vllm_config(self, model: str) -> Dict[str, Any]:
+        """Get VLLM configuration for a model."""
+        return {
+            "model": model,
+            "max_model_len": self.max_model_len,
+            "gpu_memory_utilization": self.gpu_memory_utilization,
+            "trust_remote_code": False,
+            "enforce_eager": True,  # Disable graph optimization for faster startup
+        }
+    
+    async def _load_model(self, model: str, purpose: str) -> LLM:
+        """Load a VLLM model instance."""
+        print(f"📥 Loading {purpose} model: {model}")
+        
+        try:
+            config = self._get_vllm_config(model)
+            llm = LLM(**config)
+            
+            self._loaded_models.append(model)
+            print(f"✅ Model loaded successfully: {model}")
+            return llm
+            
+        except Exception as e:
+            print(f"❌ Failed to load model {model}: {e}")
+            raise
+    
+    async def _get_solver_model(self) -> LLM:
+        """Get or load the solver model."""
+        if self._solver_llm is None:
+            self._solver_llm = await self._load_model(self.solver_model, "solver")
+        return self._solver_llm
+    
+    async def _get_grader_model(self) -> LLM:
+        """Get or load the grader model."""
+        if self._grader_llm is None:
+            # If solver and grader use the same model, reuse the instance
+            if self.solver_model == self.grader_model and self._solver_llm is not None:
+                print(f"♻️ Reusing solver model for grading: {self.grader_model}")
+                self._grader_llm = self._solver_llm
+            else:
+                self._grader_llm = await self._load_model(self.grader_model, "grader")
+        return self._grader_llm
+    
+    def _format_messages_as_prompt(self, messages: List[Dict[str, str]]) -> str:
+        """Convert chat messages to a single prompt string."""
+        prompt_parts = []
+        
+        for message in messages:
+            role = message["role"]
+            content = message["content"]
+            
+            if role == "system":
+                prompt_parts.append(f"System: {content}")
+            elif role == "user":
+                prompt_parts.append(f"User: {content}")
+            elif role == "assistant":
+                prompt_parts.append(f"Assistant: {content}")
+        
+        # Add final assistant prompt
+        if not messages[-1]["role"] == "assistant":
+            prompt_parts.append("Assistant:")
+        
+        return "\n\n".join(prompt_parts)
+    
+    def _extract_json_from_response(self, response: str) -> Optional[Dict]:
+        """Extract JSON from model response."""
+        try:
+            # Try to find JSON in the response
+            json_match = re.search(r'\{.*\}', response, re.DOTALL)
+            if json_match:
+                json_str = json_match.group()
+                return json.loads(json_str)
+            
+            # If no JSON found, try to parse the entire response
+            return json.loads(response.strip())
+            
+        except json.JSONDecodeError:
+            # If JSON parsing fails, return None
+            return None
+    
+    async def _call_api(self, 
+                       model: str, 
+                       messages: List[Dict[str, str]], 
+                       temperature: float = 0.0) -> Tuple[Optional[str], str]:
+        """
+        Make an inference call using VLLM.
+        
+        Args:
+            model: Model name to use
+            messages: List of messages in chat format
+            temperature: Temperature for generation
+            
+        Returns:
+            Tuple of (response_content, raw_response)
+        """
+        try:
+            # Get the appropriate model instance
+            if model == self.solver_model:
+                llm = await self._get_solver_model()
+            elif model == self.grader_model:
+                llm = await self._get_grader_model()
+            else:
+                raise ValueError(f"Unknown model: {model}")
+            
+            # Convert messages to prompt
+            prompt = self._format_messages_as_prompt(messages)
+            
+            # Set up sampling parameters
+            sampling_params = SamplingParams(
+                temperature=temperature,
+                top_p=0.95,
+                max_tokens=500,  # Reasonable limit for responses
+                stop=["\nUser:", "\nSystem:"]  # Stop at new conversation turns
+            )
+            
+            # Generate response
+            outputs = llm.generate([prompt], sampling_params)
+            
+            if outputs and len(outputs) > 0:
+                generated_text = outputs[0].outputs[0].text
+                return generated_text.strip(), generated_text
+            else:
+                return None, ""
+            
+        except Exception as e:
+            print(f"❌ VLLM inference error: {str(e)}")
+            raise
+    
+    def get_model_info(self) -> Dict[str, str]:
+        """Get information about the configured models."""
+        return {
+            "solver_model": self.solver_model,
+            "grader_model": self.grader_model,
+            "provider": "vllm_direct",
+            "device": self.device,
+            "loaded_models": self._loaded_models
+        }
+    
+    async def health_check(self) -> bool:
+        """
+        Perform a simple health check to verify VLLM functionality.
+        
+        Returns:
+            True if models can be loaded and generate text, False otherwise
+        """
+        try:
+            print(f"🔍 VLLM health check starting...")
+            
+            # Try to load and use the solver model
+            test_messages = [
+                {"role": "user", "content": "Hello! Please respond with 'Health check OK'."}
+            ]
+            
+            result, _ = await self._call_api(
+                model=self.solver_model,
+                messages=test_messages,
+                temperature=0.1
+            )
+            
+            if result and len(result) > 0:
+                print(f"✅ VLLM health check passed for {self.solver_model}")
+                print(f"   Response: {result[:50]}...")
+                return True
+            else:
+                print(f"❌ VLLM health check failed: empty response")
+                return False
+                
+        except Exception as e:
+            print(f"❌ VLLM health check failed: {str(e)}")
+            return False
+    
+    async def estimate_cost(self, 
+                          num_problems: int, 
+                          avg_problem_length: int = 1000,
+                          avg_solution_length: int = 2000) -> Dict[str, float]:
+        """
+        Estimate the cost for processing a given number of problems.
+        For direct VLLM, cost is computational (time/energy).
+        
+        Args:
+            num_problems: Number of problems to process
+            avg_problem_length: Average length of problem statements in characters
+            avg_solution_length: Average length of solutions in characters
+            
+        Returns:
+            Dictionary with cost estimates
+        """
+        # Token estimates (1 token ≈ 4 characters)
+        tokens_per_solve = (avg_problem_length + avg_solution_length) // 4
+        tokens_per_grade = (avg_problem_length + avg_solution_length * 2) // 4
+        
+        # Model size cost factors (based on parameter count)
+        model_costs = {
+            "gpt2": 1.0,      # 124M params
+            "distilgpt2": 0.5, # 82M params
+            "microsoft/dialo": 1.2,  # DialoGPT variants
+            "tinyllama": 2.0,  # 1.1B params
+        }
+        
+        def get_model_cost(model: str) -> float:
+            model_lower = model.lower()
+            for key, cost in model_costs.items():
+                if key in model_lower:
+                    return cost
+            return 1.5  # Default cost
+        
+        solver_cost_factor = get_model_cost(self.solver_model)
+        grader_cost_factor = get_model_cost(self.grader_model)
+        
+        # Computational cost estimation (arbitrary units)
+        solve_cost = tokens_per_solve * num_problems * solver_cost_factor / 10000
+        grade_cost = tokens_per_grade * num_problems * grader_cost_factor / 10000
+        
+        total_cost = solve_cost + grade_cost
+        
+        return {
+            "solve_cost": round(solve_cost, 4),
+            "grade_cost": round(grade_cost, 4),
+            "total_cost": round(total_cost, 4),
+            "cost_per_problem": round(total_cost / num_problems, 6),
+            "currency": "computational_units",
+            "note": "Direct VLLM costs are computational (GPU time/energy)"
+        }
+    
+    async def unload_all_models(self):
+        """Unload all loaded models to free GPU memory."""
+        try:
+            print("🗑️ Unloading VLLM models...")
+            
+            # Clean up model instances
+            if self._solver_llm is not None:
+                del self._solver_llm
+                self._solver_llm = None
+            
+            if self._grader_llm is not None and self._grader_llm != self._solver_llm:
+                del self._grader_llm
+                self._grader_llm = None
+            
+            # Clear CUDA cache
+            if torch.cuda.is_available():
+                torch.cuda.empty_cache()
+            
+            self._loaded_models.clear()
+            print("✅ Models unloaded successfully")
+            
+        except Exception as e:
+            print(f"⚠️ Error during model cleanup: {e}") 
+\ No newline at end of file
diff --git a/putnam-bench-anon/loader/vllm_local.py b/putnam-bench-anon/loader/vllm_local.py
new file mode 100644
index 0000000..bc8c4fb
--- /dev/null
+++ b/putnam-bench-anon/loader/vllm_local.py
@@ -0,0 +1,224 @@
+"""
+VLLM local model loader implementation.
+Handles API calls to locally deployed VLLM services with OpenAI-compatible endpoints.
+"""
+
+import asyncio
+import random
+from typing import Dict, List, Tuple, Optional
+
+try:
+    from openai import AsyncOpenAI, RateLimitError, APIError, APIConnectionError
+except ImportError:
+    AsyncOpenAI = None
+    RateLimitError = Exception
+    APIError = Exception
+    APIConnectionError = Exception
+
+from .base import ModelLoader
+from .prompts import RESPONSE_FORMAT
+
+
+class VLLMModelLoader(ModelLoader):
+    """VLLM local model implementation of the ModelLoader."""
+    
+    def __init__(self, 
+                 solver_model: str = "meta-llama/Llama-3.2-3B-Instruct",
+                 grader_model: str = "meta-llama/Llama-3.2-8B-Instruct", 
+                 base_url: str = "http://localhost:8000/v1",
+                 api_key: str = "EMPTY",
+                 **kwargs):
+        """
+        Initialize VLLM model loader.
+        
+        Args:
+            solver_model: Model name for solving problems (default: Llama-3.2-3B-Instruct)
+            grader_model: Model name for grading solutions (default: Llama-3.2-8B-Instruct)
+            base_url: VLLM server URL (default: http://localhost:8000/v1)
+            api_key: API key for VLLM server (default: "EMPTY" for local)
+            **kwargs: Additional arguments passed to parent class
+        """
+        if AsyncOpenAI is None:
+            raise ImportError(
+                "openai package is required for VLLMModelLoader. "
+                "Install with: pip install openai"
+            )
+            
+        super().__init__(solver_model, grader_model, **kwargs)
+        
+        # Initialize OpenAI-compatible client for VLLM
+        self.client = AsyncOpenAI(
+            base_url=base_url,
+            api_key=api_key
+        )
+        self.base_url = base_url
+    
+    async def _call_api(self, 
+                       model: str, 
+                       messages: List[Dict[str, str]], 
+                       temperature: float = 0.0) -> Tuple[Optional[str], str]:
+        """
+        Make an API call to VLLM server.
+        
+        Args:
+            model: Model name to use
+            messages: List of messages in chat format
+            temperature: Temperature for generation
+            
+        Returns:
+            Tuple of (response_content, raw_response)
+        """
+        try:
+            # Prepare API call parameters
+            api_params = {
+                "model": model,
+                "messages": messages,
+                "temperature": temperature,
+                "max_tokens": 4000,
+            }
+            
+            # Only add response_format for models that support it
+            # Most local models may not support structured JSON output
+            if temperature == 0.0:
+                try:
+                    api_params["response_format"] = RESPONSE_FORMAT
+                except:
+                    # If JSON format is not supported, we'll parse manually
+                    pass
+            
+            # Make the API call
+            response = await self.client.chat.completions.create(**api_params)
+            
+            # Extract response content
+            content = response.choices[0].message.content or ""
+            
+            return content, content
+            
+        except (RateLimitError, APIError, APIConnectionError) as e:
+            # Handle various API errors
+            error_str = str(e)
+            print(f"❌ VLLM API Error: {error_str}")
+            
+            if "rate" in error_str.lower() or "limit" in error_str.lower():
+                sleep_time = 2 + random.random()
+                print(f"   ⏰ Rate limited, sleeping {sleep_time:.1f}s")
+                await asyncio.sleep(sleep_time)
+            
+            # Re-raise to trigger retry logic
+            raise
+            
+        except Exception as e:
+            print(f"❌ Unexpected error in VLLM API call: {str(e)}")
+            raise
+    
+    def get_model_info(self) -> Dict[str, str]:
+        """Get information about the configured models."""
+        return {
+            "solver_model": self.solver_model,
+            "grader_model": self.grader_model,
+            "provider": "vllm",
+            "base_url": self.base_url
+        }
+    
+    async def health_check(self) -> bool:
+        """
+        Perform a simple health check to verify VLLM server connectivity.
+        
+        Returns:
+            True if server is accessible, False otherwise
+        """
+        try:
+            # Simple test call
+            test_messages = [
+                {"role": "user", "content": "Hello, please respond with a simple JSON: {\"status\": \"ok\"}"}
+            ]
+            
+            result, _ = await self._call_api(
+                model=self.solver_model,
+                messages=test_messages,
+                temperature=0.0
+            )
+            
+            if result and ("ok" in result.lower() or "hello" in result.lower()):
+                print(f"✅ VLLM API health check passed for {self.solver_model}")
+                return True
+            else:
+                print(f"⚠️ VLLM API health check returned unexpected response")
+                return False
+                
+        except Exception as e:
+            print(f"❌ VLLM API health check failed: {str(e)}")
+            print(f"   Make sure VLLM server is running at {self.base_url}")
+            return False
+    
+    async def estimate_cost(self, 
+                          num_problems: int, 
+                          avg_problem_length: int = 1000,
+                          avg_solution_length: int = 2000) -> Dict[str, float]:
+        """
+        Estimate the cost for processing a given number of problems.
+        For local VLLM, cost is typically computational (time/energy) rather than monetary.
+        
+        Args:
+            num_problems: Number of problems to process
+            avg_problem_length: Average length of problem statements in characters
+            avg_solution_length: Average length of solutions in characters
+            
+        Returns:
+            Dictionary with cost estimates (computational cost in arbitrary units)
+        """
+        # Rough token estimates (1 token ≈ 4 characters for English)
+        tokens_per_solve = (avg_problem_length + avg_solution_length) // 4
+        tokens_per_grade = (avg_problem_length + avg_solution_length * 2) // 4
+        
+        # Computational cost estimation (arbitrary units based on model size)
+        # Larger models consume more computational resources
+        model_costs = {
+            "llama-3.2-1b": 1.0,
+            "llama-3.2-3b": 2.0, 
+            "llama-3.2-8b": 4.0,
+            "llama-3.1-8b": 4.0,
+            "llama-3.1-70b": 20.0,
+            "mistral-7b": 3.0,
+            "qwen2.5-7b": 3.0,
+        }
+        
+        def get_model_cost(model: str) -> float:
+            model_lower = model.lower()
+            for key, cost in model_costs.items():
+                if key in model_lower:
+                    return cost
+            return 3.0  # Default cost for unknown models
+        
+        # Calculate computational costs
+        solver_cost_factor = get_model_cost(self.solver_model)
+        grader_cost_factor = get_model_cost(self.grader_model)
+        
+        solve_cost = tokens_per_solve * num_problems * solver_cost_factor / 1000
+        grade_cost = tokens_per_grade * num_problems * grader_cost_factor / 1000
+        
+        total_cost = solve_cost + grade_cost
+        
+        return {
+            "solve_cost": round(solve_cost, 4),
+            "grade_cost": round(grade_cost, 4),
+            "total_cost": round(total_cost, 4),
+            "cost_per_problem": round(total_cost / num_problems, 6),
+            "currency": "computational_units",
+            "note": "Local VLLM costs are computational (time/energy) rather than monetary"
+        }
+    
+    async def list_models(self) -> List[str]:
+        """
+        List available models on the VLLM server.
+        
+        Returns:
+            List of available model names
+        """
+        try:
+            # Try to get models list from VLLM server
+            models_response = await self.client.models.list()
+            return [model.id for model in models_response.data]
+        except Exception as e:
+            print(f"⚠️ Could not retrieve models list: {str(e)}")
+            return [self.solver_model, self.grader_model]
diff --git a/putnam-bench-anon/loader/xai_client.py b/putnam-bench-anon/loader/xai_client.py
new file mode 100644
index 0000000..10c4cf4
--- /dev/null
+++ b/putnam-bench-anon/loader/xai_client.py
@@ -0,0 +1,173 @@
+"""
+xAI model loader implementation.
+Handles API calls to xAI Grok models using OpenAI-compatible interface.
+"""
+
+import os
+from typing import Dict, Optional, List, Tuple
+
+from .openai_client import OpenAIModelLoader
+
+
+class XAIModelLoader(OpenAIModelLoader):
+    """xAI implementation using OpenAI-compatible API."""
+    
+    def __init__(self, 
+                 solver_model: str = "grok-3",
+                 grader_model: str = "grok-3",
+                 api_key: Optional[str] = None,
+                 **kwargs):
+        """
+        Initialize xAI model loader.
+        
+        Args:
+            solver_model: xAI model for solving problems (default: grok-3)
+            grader_model: xAI model for grading solutions (default: grok-3)
+            api_key: xAI API key (if None, uses XAI_API_KEY environment variable)
+            **kwargs: Additional arguments passed to parent class
+        """
+        # Get API key from parameter or environment
+        if api_key is None:
+            api_key = os.getenv('XAI_API_KEY')
+        
+        # Initialize with xAI-specific settings
+        super().__init__(
+            solver_model=solver_model,
+            grader_model=grader_model,
+            api_key=api_key,
+            base_url="https://api.x.ai/v1",
+            **kwargs
+        )
+    
+    async def _call_api(self, 
+                       model: str, 
+                       messages: List[Dict[str, str]], 
+                       temperature: float = 0.0) -> Tuple[Optional[str], str]:
+        """
+        Make an API call to xAI with proper error handling.
+        
+        Args:
+            model: xAI model name
+            messages: List of messages in chat format
+            temperature: Temperature for generation
+            
+        Returns:
+            Tuple of (response_content, raw_response)
+        """
+        try:
+            # Call parent's implementation
+            return await super()._call_api(model, messages, temperature)
+            
+        except Exception as e:
+            # Replace "OpenAI" with "xAI" in error messages
+            error_msg = str(e)
+            if "OpenAI API Error" in error_msg:
+                error_msg = error_msg.replace("OpenAI API Error", "xAI API Error")
+            
+            # Log with xAI-specific prefix
+            if "RateLimitError" in type(e).__name__:
+                print(f"🚫 xAI RateLimitError: {error_msg}")
+                raise
+            elif "APIError" in type(e).__name__ or "APIConnectionError" in type(e).__name__:
+                print(f"❌ xAI API Error: {error_msg}")
+                raise
+            else:
+                print(f"❌ Unexpected error in xAI API call: {error_msg}")
+                raise
+    
+    def get_model_info(self) -> Dict[str, str]:
+        """Get information about the configured models."""
+        return {
+            "solver_model": self.solver_model,
+            "grader_model": self.grader_model,
+            "provider": "xai",
+            "base_url": "https://api.x.ai/v1"
+        }
+    
+    async def health_check(self) -> bool:
+        """
+        Perform a simple health check to verify xAI API connectivity.
+        
+        Returns:
+            True if API is accessible, False otherwise
+        """
+        try:
+            # Simple test call
+            test_messages = [
+                {"role": "user", "content": "Hello, please respond with a simple JSON: {\"status\": \"ok\"}"}
+            ]
+            
+            result, _ = await self._call_api(
+                model=self.solver_model,
+                messages=test_messages,
+                temperature=0.0
+            )
+            
+            if result and "ok" in result.lower():
+                print(f"✅ xAI API health check passed for {self.solver_model}")
+                return True
+            else:
+                print(f"⚠️ xAI API health check returned unexpected response")
+                return False
+                
+        except Exception as e:
+            print(f"❌ xAI API health check failed: {str(e)}")
+            return False
+    
+    async def estimate_cost(self, 
+                          num_problems: int, 
+                          avg_problem_length: int = 1000,
+                          avg_solution_length: int = 2000) -> Dict[str, float]:
+        """
+        Estimate the cost for processing a given number of problems with xAI models.
+        
+        Args:
+            num_problems: Number of problems to process
+            avg_problem_length: Average length of problem statements in characters
+            avg_solution_length: Average length of solutions in characters
+            
+        Returns:
+            Dictionary with cost estimates
+        """
+        # Rough token estimates (1 token ≈ 4 characters for English)
+        tokens_per_solve = (avg_problem_length + avg_solution_length) // 4
+        tokens_per_grade = (avg_problem_length + avg_solution_length * 2) // 4
+        
+        # xAI pricing (update with actual pricing when available)
+        # These are estimates based on similar model pricing
+        pricing = {
+            "grok-3": {"input": 0.01, "output": 0.03},  # per 1K tokens (estimated)
+            "grok-2": {"input": 0.005, "output": 0.015},  # per 1K tokens (estimated)
+        }
+        
+        def get_model_cost(model: str, input_tokens: int, output_tokens: int) -> float:
+            if model not in pricing:
+                model = "grok-3"  # Default to grok-3 pricing
+            
+            input_cost = (input_tokens / 1000) * pricing[model]["input"]
+            output_cost = (output_tokens / 1000) * pricing[model]["output"]
+            return input_cost + output_cost
+        
+        # Calculate costs
+        solve_cost = get_model_cost(
+            self.solver_model, 
+            tokens_per_solve * num_problems,
+            tokens_per_solve * num_problems // 2  # Assume output is ~50% of input
+        )
+        
+        grade_cost = get_model_cost(
+            self.grader_model,
+            tokens_per_grade * num_problems,
+            tokens_per_grade * num_problems // 3  # Assume output is ~33% of input
+        )
+        
+        total_cost = solve_cost + grade_cost
+        
+        return {
+            "solve_cost": round(solve_cost, 4),
+            "grade_cost": round(grade_cost, 4),
+            "total_cost": round(total_cost, 4),
+            "cost_per_problem": round(total_cost / num_problems, 6),
+            "currency": "USD",
+            "note": "xAI pricing estimates - update with actual pricing"
+        } 
+\ No newline at end of file
diff --git a/putnam-bench-anon/putnam_cli.py b/putnam-bench-anon/putnam_cli.py
new file mode 100644
index 0000000..59ca5d3
--- /dev/null
+++ b/putnam-bench-anon/putnam_cli.py
@@ -0,0 +1,813 @@
+#!/usr/bin/env python3
+"""
+Putnam CLI - Simple command-line interface for mathematical problem solving.
+
+This CLI provides easy-to-use commands for testing problems, checking health,
+running benchmarks, and managing the system.
+
+Usage:
+    putnam solve problem.json                    # Solve a single problem
+    putnam test --provider openai                # Quick test
+    putnam health                                # Check all providers
+    putnam benchmark --quick                     # Quick benchmark
+    putnam batch dataset/ --provider anthropic  # Batch evaluation
+    
+Cross-provider usage:
+    putnam solve problem.json --solver-provider kimi --grader-provider openai
+    putnam batch dataset/ --solver-provider kimi --grader-provider openai
+"""
+
+import asyncio
+import json
+import sys
+from pathlib import Path
+import argparse
+from typing import Dict, Any, Optional
+import os
+
+# Add the loader module to the path
+sys.path.append(str(Path(__file__).parent))
+
+from loader import create_loader, create_cross_provider_loader, get_supported_providers, get_default_models
+
+
+class PutnamCLI:
+    """Main CLI class for Putnam problem solver."""
+    
+    def __init__(self):
+        self.verbose = False
+    
+    def print_banner(self):
+        """Print CLI banner."""
+        print("🧮 Putnam Mathematical Problem Solver CLI")
+        print("=" * 50)
+    
+    def print_providers(self):
+        """Print available providers."""
+        print("\n🤖 Available Providers:")
+        for provider in get_supported_providers():
+            defaults = get_default_models(provider)
+            print(f"   • {provider.upper()}")
+            print(f"     Solver: {defaults['solver_model']}")
+            print(f"     Grader: {defaults['grader_model']}")
+        print()
+    
+    def _create_loader(self, args, loader_kwargs: Optional[Dict] = None) -> Any:
+        """
+        Create a loader based on command-line arguments.
+        Handles both single-provider and cross-provider scenarios.
+        
+        Args:
+            args: Command-line arguments
+            loader_kwargs: Additional kwargs for loader creation
+            
+        Returns:
+            ModelLoader instance
+        """
+        loader_kwargs = loader_kwargs or {}
+        
+        # Add debug flag if available
+        if hasattr(args, 'debug') and args.debug:
+            loader_kwargs['debug'] = True
+        
+        # Handle provider-specific settings
+        if hasattr(args, 'vllm_url') and args.vllm_url:
+            if args.provider == 'vllm' or (hasattr(args, 'solver_provider') and args.solver_provider == 'vllm'):
+                loader_kwargs['solver_kwargs'] = loader_kwargs.get('solver_kwargs', {})
+                loader_kwargs['solver_kwargs']['base_url'] = args.vllm_url
+            if hasattr(args, 'grader_provider') and args.grader_provider == 'vllm':
+                loader_kwargs['grader_kwargs'] = loader_kwargs.get('grader_kwargs', {})
+                loader_kwargs['grader_kwargs']['base_url'] = args.vllm_url
+        
+        if hasattr(args, 'device') and args.device:
+            if args.provider == 'huggingface' or (hasattr(args, 'solver_provider') and args.solver_provider == 'huggingface'):
+                loader_kwargs['solver_kwargs'] = loader_kwargs.get('solver_kwargs', {})
+                loader_kwargs['solver_kwargs']['device'] = args.device
+            if hasattr(args, 'grader_provider') and args.grader_provider == 'huggingface':
+                loader_kwargs['grader_kwargs'] = loader_kwargs.get('grader_kwargs', {})
+                loader_kwargs['grader_kwargs']['device'] = args.device
+        
+        # Check if we're using cross-provider mode
+        if hasattr(args, 'solver_provider') and args.solver_provider:
+            # Cross-provider mode
+            print(f"🚀 Using solver provider: {args.solver_provider}")
+            if hasattr(args, 'grader_provider') and args.grader_provider:
+                print(f"🎯 Using grader provider: {args.grader_provider}")
+            else:
+                print(f"🎯 Using grader provider: {args.solver_provider} (same as solver)")
+            
+            return create_cross_provider_loader(
+                solver_provider=args.solver_provider,
+                grader_provider=args.grader_provider if hasattr(args, 'grader_provider') else None,
+                solver_model=args.solver_model if hasattr(args, 'solver_model') else None,
+                grader_model=args.grader_model if hasattr(args, 'grader_model') else None,
+                **loader_kwargs
+            )
+        else:
+            # Single provider mode (backward compatibility)
+            provider = args.provider if hasattr(args, 'provider') else "openai"
+            print(f"🚀 Using provider: {provider}")
+            
+            # Handle special cases for single provider
+            if provider == 'vllm' and hasattr(args, 'vllm_url'):
+                loader_kwargs['base_url'] = args.vllm_url
+            elif provider == 'huggingface' and hasattr(args, 'device'):
+                loader_kwargs['device'] = args.device
+            
+            return create_loader(
+                provider,
+                solver_model=args.solver_model if hasattr(args, 'solver_model') else None,
+                grader_model=args.grader_model if hasattr(args, 'grader_model') else None,
+                **loader_kwargs
+            )
+    
+    async def cmd_solve(self, args) -> int:
+        """Solve a single problem."""
+        self.print_banner()
+        
+        # Setup logging
+        import logging
+        from datetime import datetime
+        from pathlib import Path
+        
+        # Create log file
+        log_dir = Path("solve_logs")
+        log_dir.mkdir(exist_ok=True)
+        timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
+        log_file = log_dir / f"solve_debug_{timestamp}.log"
+        
+        # Setup logging
+        logging.basicConfig(
+            level=logging.INFO,
+            format='%(asctime)s - %(levelname)s - %(message)s',
+            handlers=[
+                logging.FileHandler(log_file),
+                logging.StreamHandler()
+            ]
+        )
+        logger = logging.getLogger(__name__)
+        
+        logger.info(f"🔍 Starting solve command, log file: {log_file}")
+        
+        # Load problem
+        try:
+            with open(args.problem_file, 'r', encoding='utf-8') as f:
+                problem_data = json.load(f)
+            logger.info(f"📁 Problem loaded from {args.problem_file}")
+        except Exception as e:
+            logger.error(f"❌ Error loading problem: {str(e)}")
+            return 1
+        
+        # Setup provider
+        loader = self._create_loader(args)
+        logger.info(f"🤖 Created loader: solver={loader.solver_model}, grader={loader.grader_model}")
+        
+        # Health check
+        print("🔍 Checking provider health...")
+        if not await loader.health_check():
+            logger.error("❌ Provider health check failed")
+            return 1
+        
+        # Show problem
+        variant_type = args.variant or "original"
+        problem_stmt = problem_data.get(variant_type, {}).get('problem_statement', 'N/A')
+        logger.info(f"📝 Problem variant: {variant_type}")
+        logger.info(f"📄 Problem statement: {problem_stmt[:500]}...")
+        
+        print(f"\n📝 Problem ({variant_type}):")
+        print(f"   {problem_stmt[:200]}{'...' if len(problem_stmt) > 200 else ''}")
+        
+        # Solve
+        print(f"\n⚡ Solving with {loader.solver_model}...")
+        logger.info(f"🔄 Starting solve process...")
+        
+        result = await loader.test_single_problem(
+            problem_data, 
+            variant_type=variant_type,
+            solver_model=args.solver_model,
+            grader_model=args.grader_model
+        )
+        
+        # Log detailed results
+        logger.info("📊 DETAILED RESULTS:")
+        logger.info(f"   Full result: {json.dumps(result, indent=2, ensure_ascii=False)}")
+        
+        # Analyze solve step
+        solve_data = result.get('solve', {})
+        solve_status = solve_data.get('status', 'unknown')
+        logger.info(f"🔍 SOLVE ANALYSIS:")
+        logger.info(f"   Status: {solve_status}")
+        
+        if solve_status == 'success':
+            solution = solve_data.get('solution', 'N/A')
+            logger.info(f"   Solution length: {len(solution)} characters")
+            logger.info(f"   Solution preview: {solution[:200]}...")
+        else:
+            error_msg = solve_data.get('error', 'No error message')
+            logger.error(f"   Solve error: {error_msg}")
+            
+        # Analyze grade step
+        grade_data = result.get('grade', {})
+        grade_status = grade_data.get('status', 'unknown')
+        logger.info(f"🔍 GRADE ANALYSIS:")
+        logger.info(f"   Status: {grade_status}")
+        
+        if grade_status == 'success':
+            grade = grade_data.get('grade', 'N/A')
+            feedback = grade_data.get('detailed_feedback', 'N/A')
+            logger.info(f"   Grade: {grade}")
+            logger.info(f"   Feedback: {feedback}")
+        else:
+            error_msg = grade_data.get('error', 'No error message')
+            logger.error(f"   Grade error: {error_msg}")
+        
+        # Show results
+        print(f"\n✅ Solution completed!")
+        
+        # Extract and display grade
+        if result.get('grade', {}).get('status') == 'success':
+            grade = result.get('grade', {}).get('grade', 'N/A')
+            is_correct = result.get('correct', False)
+            grade_display = f"{grade} ({'✓' if is_correct else '✗'})"
+        else:
+            grade_display = 'N/A (grading failed)'
+        
+        # Extract and display solution
+        if result.get('solve', {}).get('status') == 'success':
+            solution = result.get('solve', {}).get('solution', 'N/A')
+        else:
+            solution = 'N/A (solving failed)'
+        
+        print(f"🎯 Final Grade: {grade_display}")
+        print(f"🤖 Solution:")
+        print(f"   {solution[:300]}{'...' if len(solution) > 300 else ''}")
+        
+        if args.verbose:
+            grading = result.get('grade', {})
+            print(f"\n📊 Grading Details:")
+            print(f"   Feedback: {grading.get('detailed_feedback', 'N/A')[:200]}...")
+            print(f"   Major Issues: {grading.get('major_issues', 'N/A')}")
+            print(f"   Rigor Score: {grading.get('reasoning_rigor_score', 'N/A')}")
+        
+        # Save detailed results
+        results_file = log_dir / f"solve_results_{timestamp}.json"
+        with open(results_file, 'w', encoding='utf-8') as f:
+            json.dump(result, f, indent=2, ensure_ascii=False)
+        logger.info(f"💾 Detailed results saved to {results_file}")
+        
+        # Save if requested
+        if args.output:
+            with open(args.output, 'w', encoding='utf-8') as f:
+                json.dump(result, f, indent=2, ensure_ascii=False)
+            print(f"💾 Results saved to {args.output}")
+        
+        print(f"\n📋 Log file created: {log_file}")
+        print(f"📋 Results file created: {results_file}")
+        
+        return 0
+            
+    async def cmd_test(self, args) -> int:
+        """Quick test of a provider."""
+        self.print_banner()
+        
+        # Create simple test problem
+        test_problem = {
+            'question': 'Calculate 15 + 27.',
+            'solution': 'The answer is 42.',
+            'problem_type': 'calculation'
+        }
+        
+        try:
+            loader = self._create_loader(args)
+            
+            print("🔍 Health check...")
+            if not await loader.health_check():
+                print("❌ Health check failed")
+                return 1
+            
+            print("⚡ Running test problem...")
+            result = await loader.test_single_problem(test_problem, variant_type='original')
+            
+            print(f"✅ Test completed!")
+            
+            # Extract grade information
+            if result.get('grade', {}).get('status') == 'success':
+                grade = result.get('grade', {}).get('grade', 'N/A')
+                is_correct = result.get('correct', False)
+                grade_display = f"{grade} ({'✓' if is_correct else '✗'})"
+            else:
+                grade_display = 'N/A (grading failed)'
+            
+            # Extract solution
+            if result.get('solve', {}).get('status') == 'success':
+                solution = result.get('solve', {}).get('solution', 'N/A')
+            else:
+                solution = 'N/A (solving failed)'
+            
+            print(f"🎯 Grade: {grade_display}")
+            print(f"🤖 Solution: {solution[:100]}...")
+            
+            return 0
+            
+        except Exception as e:
+            print(f"❌ Test failed: {str(e)}")
+            return 1
+    
+    async def cmd_health(self, args) -> int:
+        """Check health of providers."""
+        self.print_banner()
+        
+        print("🏥 Checking provider health...")
+        
+        # Import health check
+        try:
+            from scripts.health_check import HealthChecker
+            checker = HealthChecker(detailed=args.detailed)
+            
+            results = await checker.check_all_providers(args.provider)
+            
+            # Simple summary
+            summary = results['summary']
+            print(f"\n📋 Summary: {summary['healthy_providers']}/{summary['total_providers']} providers healthy")
+            
+            return 0 if summary['healthy_providers'] > 0 else 1
+            
+        except ImportError:
+            print("❌ Health check module not available")
+            return 1
+        except Exception as e:
+            print(f"❌ Health check failed: {str(e)}")
+            return 1
+    
+    async def cmd_benchmark(self, args) -> int:
+        """Run benchmark."""
+        self.print_banner()
+        
+        print("🏁 Running benchmark...")
+        
+        try:
+            from scripts.benchmark import run_quick_test
+            await run_quick_test()
+            return 0
+            
+        except ImportError:
+            print("❌ Benchmark module not available")
+            return 1
+        except Exception as e:
+            print(f"❌ Benchmark failed: {str(e)}")
+            return 1
+    
+    async def cmd_batch(self, args) -> int:
+        """Run batch evaluation."""
+        self.print_banner()
+        
+        # Handle resume case - simplified version
+        if args.resume:
+            if not args.resume.exists():
+                print(f"❌ Resume checkpoint file not found: {args.resume}")
+                return 1
+            
+            # Simple resume: just read completed problems list
+            print(f"📂 Resuming from checkpoint: {args.resume}")
+            with open(args.resume) as f:
+                checkpoint_data = json.load(f)
+            
+            # Extract completed problem indices
+            completed_indices = checkpoint_data.get('completed_indices', [])
+            print(f"   Found {len(completed_indices)} completed problems to skip")
+            
+            # Still need dataset path for resume
+            if not args.dataset_path:
+                # Try to get from checkpoint for convenience
+                dataset_path = checkpoint_data.get('dataset_path')
+                if dataset_path:
+                    dataset_path = Path(dataset_path)
+                    print(f"   Using dataset path from checkpoint: {dataset_path}")
+                else:
+                    print("❌ Dataset path is required when resuming")
+                    return 1
+            else:
+                dataset_path = Path(args.dataset_path)
+        else:
+            # New evaluation
+            if not args.dataset_path:
+                print("❌ Dataset path is required for new batch evaluation.")
+                return 1
+            dataset_path = Path(args.dataset_path)
+            if not dataset_path.exists():
+                print(f"❌ Dataset path not found: {dataset_path}")
+                return 1
+        
+        try:
+            # Import batch evaluation functions
+            from scripts.batch_evaluate import batch_evaluate, batch_evaluate_cross
+            
+            # Check if we need to run all variants
+            if args.variant == "all" and not args.resume:
+                # All available variants
+                all_variants = ["original", "descriptive_long", "descriptive_long_confusing",
+                               "descriptive_long_misleading", "garbled_string", "kernel_variant"]
+                
+                print(f"🔄 Running all {len(all_variants)} variants sequentially...")
+                
+                overall_results = []
+                for i, variant in enumerate(all_variants, 1):
+                    print(f"\n{'='*60}")
+                    print(f"📍 Variant {i}/{len(all_variants)}: {variant}")
+                    print(f"{'='*60}")
+                    
+                    # Determine output file for this variant
+                    if args.output:
+                        # If output specified, append variant name
+                        output_path = Path(args.output)
+                        output_file = output_path.parent / f"{output_path.stem}_{variant}{output_path.suffix}"
+                    else:
+                        output_file = None
+                    
+                    # Run batch evaluation for this variant
+                    if hasattr(args, 'solver_provider') and args.solver_provider:
+                        # Cross-provider batch evaluation
+                        results = await batch_evaluate_cross(
+                            dataset_path=dataset_path,
+                            solver_provider=args.solver_provider,
+                            grader_provider=args.grader_provider if hasattr(args, 'grader_provider') else args.solver_provider,
+                            variant_type=variant,
+                            max_concurrent=args.concurrent or 3,
+                            max_files=args.max_files,
+                            solver_model=args.solver_model,
+                            grader_model=args.grader_model,
+                            output_file=output_file,
+                            resume_checkpoint=args.resume,
+                            vllm_url=args.vllm_url if hasattr(args, 'vllm_url') else None,
+                            device=args.device if hasattr(args, 'device') else None,
+                            quick=args.quick if hasattr(args, 'quick') else False
+                        )
+                    else:
+                        # Standard batch evaluation
+                        loader_kwargs = {}
+                        provider = args.provider or "openai"
+                        if provider == 'vllm' and hasattr(args, 'vllm_url'):
+                            loader_kwargs['base_url'] = args.vllm_url
+                        elif provider == 'huggingface' and hasattr(args, 'device'):
+                            loader_kwargs['device'] = args.device
+                        
+                        # Add quick mode if specified
+                        if hasattr(args, 'quick') and args.quick:
+                            loader_kwargs['quick'] = True
+                            
+                        results = await batch_evaluate(
+                            dataset_path=dataset_path,
+                            provider=provider,
+                            variant_type=variant,
+                            max_concurrent=args.concurrent or 3,
+                            max_files=args.max_files,
+                            solver_model=args.solver_model,
+                            grader_model=args.grader_model,
+                            output_file=output_file,
+                            resume_checkpoint=args.resume,
+                            **loader_kwargs
+                        )
+                    
+                    print(f"✅ {variant} completed!")
+                    print(f"📊 Average grade: {results['summary']['average_grade']:.2f}")
+                    print(f"📈 Success rate: {results['summary']['success_rate']:.1f}%")
+                    
+                    overall_results.append({
+                        'variant': variant,
+                        'summary': results['summary']
+                    })
+                    
+                    # Wait between variants to ensure clean state
+                    if i < len(all_variants):
+                        print("\n⏳ Waiting 5 seconds before next variant...")
+                        await asyncio.sleep(5)
+                
+                # Print overall summary
+                print(f"\n{'='*60}")
+                print("📊 OVERALL SUMMARY")
+                print(f"{'='*60}")
+                
+                for result in overall_results:
+                    variant = result['variant']
+                    summary = result['summary']
+                    print(f"{variant:20s}: Grade {summary['average_grade']:5.2f}, Success {summary['success_rate']:5.1f}%")
+                
+                return 0
+            else:
+                # Single variant evaluation
+                if hasattr(args, 'solver_provider') and args.solver_provider:
+                    # Cross-provider batch evaluation
+                    results = await batch_evaluate_cross(
+                        dataset_path=dataset_path,
+                        solver_provider=args.solver_provider,
+                        grader_provider=args.grader_provider if hasattr(args, 'grader_provider') else args.solver_provider,
+                        variant_type=args.variant or "original",
+                        max_concurrent=args.concurrent or 3,
+                        max_files=args.max_files,
+                        solver_model=args.solver_model,
+                        grader_model=args.grader_model,
+                        output_file=Path(args.output) if args.output else None,
+                        resume_checkpoint=args.resume,
+                        vllm_url=args.vllm_url if hasattr(args, 'vllm_url') else None,
+                        device=args.device if hasattr(args, 'device') else None,
+                        quick=args.quick if hasattr(args, 'quick') else False
+                    )
+                else:
+                    # Standard batch evaluation
+                    loader_kwargs = {}
+                    provider = args.provider or "openai"
+                    if provider == 'vllm' and hasattr(args, 'vllm_url'):
+                        loader_kwargs['base_url'] = args.vllm_url
+                    elif provider == 'huggingface' and hasattr(args, 'device'):
+                        loader_kwargs['device'] = args.device
+                        
+                    # Add quick mode if specified
+                    if hasattr(args, 'quick') and args.quick:
+                        loader_kwargs['quick'] = True
+                        
+                    results = await batch_evaluate(
+                        dataset_path=dataset_path,
+                        provider=provider,
+                        variant_type=args.variant or "original",
+                        max_concurrent=args.concurrent or 3,
+                        max_files=args.max_files,
+                        solver_model=args.solver_model,
+                        grader_model=args.grader_model,
+                        output_file=Path(args.output) if args.output else None,
+                        resume_checkpoint=args.resume,
+                        **loader_kwargs
+                    )
+                
+                print(f"✅ Batch evaluation completed!")
+                print(f"📊 Average grade: {results['summary']['average_grade']:.2f}")
+                print(f"📈 Success rate: {results['summary']['success_rate']:.1f}%")
+                
+                return 0
+            
+        except ImportError:
+            print("❌ Batch evaluation module not available")
+            return 1
+        except Exception as e:
+            print(f"❌ Batch evaluation failed: {str(e)}")
+            return 1
+    
+    async def cmd_multi_test(self, args) -> int:
+        """Run multi-variant testing."""
+        self.print_banner()
+        
+        provider = args.provider or "openai"
+        print(f"🎯 Multi-variant testing with {provider}")
+        
+        try:
+            from scripts.batch_evaluate import batch_evaluate_all_variants
+            
+            # Run multi-variant evaluation
+            results = await batch_evaluate_all_variants(
+                dataset_path=Path(args.dataset_path or "dataset"),
+                provider=provider,
+                variants=args.variants,
+                max_concurrent=args.concurrent or 3,
+                max_files=args.max_files,
+                solver_model=args.solver_model,
+                grader_model=args.grader_model,
+                output_dir=Path(args.output_dir or "multi_variant_results"),
+                base_url=args.vllm_url if provider == 'vllm' else None,
+                device=args.device if provider == 'huggingface' else None
+            )
+            
+            print(f"✅ Multi-variant testing completed!")
+            metrics = results['aggregate_metrics']
+            print(f"📊 Overall average grade: {metrics['overall_average_grade']:.2f}")
+            print(f"📈 Overall success rate: {metrics['overall_success_rate']:.1f}%")
+            print(f"⏱️ Total time: {results['test_overview']['total_test_time_minutes']:.1f} minutes")
+            
+            comparison = results['variant_comparison']
+            if comparison['best_performing_variant']['variant']:
+                print(f"🏆 Best variant: {comparison['best_performing_variant']['variant']} "
+                      f"(Grade: {comparison['best_performing_variant']['grade']:.2f})")
+            
+            return 0
+            
+        except ImportError:
+            print("❌ Multi-variant testing module not available")
+            return 1
+        except Exception as e:
+            print(f"❌ Multi-variant testing failed: {str(e)}")
+            return 1
+
+    async def cmd_info(self, args) -> int:
+        """Show system information."""
+        self.print_banner()
+        
+        print("ℹ️ System Information")
+        print("-" * 30)
+        
+        # Check environment variables
+        print("🔧 Environment Variables:")
+        env_vars = [
+            'OPENAI_API_KEY',
+            'ANTHROPIC_API_KEY', 
+            'GOOGLE_API_KEY',
+            'XAI_API_KEY',
+            'MOONSHOT_API_KEY'
+        ]
+        for var in env_vars:
+            value = os.getenv(var)
+            status = "✅ Set" if value else "❌ Not set"
+            provider = var.replace('_API_KEY', '').replace('MOONSHOT', 'KIMI')
+            print(f"   {provider}: {status}")
+        
+        print()
+        self.print_providers()
+        
+        # Show usage examples
+        print("💡 Quick Start Examples:")
+        print("   # Single provider:")
+        print("   putnam solve dataset/1938-A-1.json")
+        print("   putnam test --provider openai")
+        print("   putnam batch dataset/ --provider anthropic --max-files 5")
+        print("")
+        print("   # Cross-provider:")
+        print("   putnam solve dataset/1938-A-1.json --solver-provider kimi --grader-provider openai")
+        print("   putnam batch dataset/ --solver-provider kimi --grader-provider openai --concurrent 200")
+        print("")
+        print("   # Full test with all variants:")
+        print("   putnam batch dataset/ --variant all --solver-provider kimi --grader-provider openai")
+        print("")
+        print("   # Resume functionality:")
+        print("   putnam batch --resume checkpoint_file.json")
+        print("   putnam batch dataset/ --provider openai --resume old_checkpoint_file.json")
+        
+        return 0
+
+
+def create_parser():
+    """Create argument parser."""
+    parser = argparse.ArgumentParser(
+        description="Putnam Mathematical Problem Solver CLI",
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog="""
+Examples:
+  Single provider:
+    putnam solve problem.json --provider openai
+    putnam test --provider anthropic
+    putnam batch dataset/ --provider gemini --max-files 10
+    
+  Cross-provider:
+    putnam solve problem.json --solver-provider kimi --grader-provider openai
+    putnam batch dataset/ --solver-provider kimi --grader-provider openai --concurrent 200
+    putnam batch dataset/ --variant all --solver-provider kimi --grader-provider openai
+    
+  Resume functionality:
+    putnam batch --resume checkpoint_file.json
+    putnam batch dataset/ --provider openai --resume old_checkpoint_file.json
+        """
+    )
+    
+    # Global options
+    parser.add_argument("--verbose", "-v", action="store_true", help="Verbose output")
+    parser.add_argument("--debug", action="store_true", help="Enable debug mode (show JSON parsing details)")
+    
+    # Subcommands
+    subparsers = parser.add_subparsers(dest="command", help="Available commands")
+    
+    # Solve command
+    solve_parser = subparsers.add_parser("solve", help="Solve a single problem")
+    solve_parser.add_argument("problem_file", type=Path, help="Problem JSON file")
+    solve_parser.add_argument("--provider", choices=get_supported_providers(), 
+                             help="AI provider (sets both solver and grader)")
+    solve_parser.add_argument("--solver-provider", choices=get_supported_providers(),
+                             help="Provider for solving")
+    solve_parser.add_argument("--grader-provider", choices=get_supported_providers(),
+
+                             help="Provider for grading")
+    solve_parser.add_argument("--variant", choices=["original", "descriptive_long", "kernel_variant"], 
+                             help="Problem variant")
+    solve_parser.add_argument("--solver-model", help="Override solver model")
+    solve_parser.add_argument("--grader-model", help="Override grader model")
+    solve_parser.add_argument("--output", "-o", type=Path, help="Save results to file")
+    solve_parser.add_argument("--debug", action="store_true", help="Enable debug mode (show JSON parsing details)")
+    solve_parser.add_argument("--vllm-url", default="http://localhost:8000/v1", 
+                             help="VLLM server URL")
+    solve_parser.add_argument("--device", choices=["auto", "cuda", "cpu"], 
+                             help="Device for HuggingFace")
+    
+    # Test command
+    test_parser = subparsers.add_parser("test", help="Quick test of a provider")
+    test_parser.add_argument("--provider", choices=get_supported_providers(), 
+                            help="AI provider (sets both solver and grader)")
+    test_parser.add_argument("--solver-provider", choices=get_supported_providers(),
+                            help="Provider for solving")
+    test_parser.add_argument("--grader-provider", choices=get_supported_providers(),
+                            help="Provider for grading")
+    test_parser.add_argument("--vllm-url", default="http://localhost:8000/v1", help="VLLM server URL")
+    test_parser.add_argument("--device", choices=["auto", "cuda", "cpu"], help="Device for HuggingFace")
+    
+    # Health command
+    health_parser = subparsers.add_parser("health", help="Check provider health")
+    health_parser.add_argument("--provider", choices=get_supported_providers(), 
+                              help="Check specific provider only")
+    health_parser.add_argument("--detailed", action="store_true", help="Detailed health check")
+    
+    # Benchmark command
+    benchmark_parser = subparsers.add_parser("benchmark", help="Run performance benchmark")
+    benchmark_parser.add_argument("--quick", action="store_true", help="Quick benchmark")
+    benchmark_parser.add_argument("--config", type=Path, help="Configuration file")
+    
+    # Batch command
+    batch_parser = subparsers.add_parser("batch", help="Batch evaluation")
+    batch_parser.add_argument("dataset_path", type=Path, nargs='?', help="Dataset directory (required for new runs, optional for resume)")
+    batch_parser.add_argument("--provider", choices=get_supported_providers(), 
+                         help="AI provider (sets both solver and grader)")
+    batch_parser.add_argument("--solver-provider", choices=get_supported_providers(),
+                         help="Provider for solving")
+    batch_parser.add_argument("--grader-provider", choices=get_supported_providers(),
+                         help="Provider for grading")
+    batch_parser.add_argument("--variant", choices=["all", "original", "descriptive_long", "descriptive_long_confusing",
+                                                "descriptive_long_misleading", "garbled_string", "kernel_variant"],
+                         help="Problem variant (use 'all' to run all variants sequentially)")
+    batch_parser.add_argument("--max-files", type=int, help="Maximum files to process")
+    batch_parser.add_argument("--concurrent", type=int, default=3, help="Concurrent evaluations")
+    batch_parser.add_argument("--solver-model", help="Override solver model")
+    batch_parser.add_argument("--grader-model", help="Override grader model")
+    batch_parser.add_argument("--output", "-o", help="Output file")
+    batch_parser.add_argument("--resume", type=Path, help="Resume from checkpoint file")
+    batch_parser.add_argument("--debug", action="store_true", help="Enable debug mode (show JSON parsing details)")
+    batch_parser.add_argument("--quick", action="store_true", help="Quick mode: allows one retry with 1200s timeout per attempt")
+    batch_parser.add_argument("--vllm-url", default="http://localhost:8000/v1", help="VLLM server URL")
+    batch_parser.add_argument("--device", choices=["auto", "cuda", "cpu"], help="Device for HuggingFace")
+    
+    # Multi-test command
+    multi_parser = subparsers.add_parser("multi-test", help="Run multi-variant testing")
+    multi_parser.add_argument("--provider", choices=get_supported_providers(), 
+                             help="AI provider (sets both solver and grader)")
+    multi_parser.add_argument("--solver-provider", choices=get_supported_providers(),
+                             help="Provider for solving")
+    multi_parser.add_argument("--grader-provider", choices=get_supported_providers(),
+                             help="Provider for grading")
+    multi_parser.add_argument("--dataset-path", type=Path, help="Dataset directory path")
+    multi_parser.add_argument("--variants", nargs="+",
+                             choices=["original", "descriptive_long", "descriptive_long_confusing",
+                                     "descriptive_long_misleading", "garbled_string", "kernel_variant"],
+                             help="Specific variants to test (default: all)")
+    multi_parser.add_argument("--max-files", type=int, help="Maximum files per variant")
+    multi_parser.add_argument("--concurrent", type=int, help="Maximum concurrent evaluations")
+    multi_parser.add_argument("--solver-model", help="Override solver model")
+    multi_parser.add_argument("--grader-model", help="Override grader model")
+    multi_parser.add_argument("--output-dir", type=Path, help="Output directory")
+    multi_parser.add_argument("--vllm-url", default="http://localhost:8000/v1", help="VLLM server URL")
+    multi_parser.add_argument("--device", choices=["auto", "cuda", "cpu"], help="Device for HuggingFace")
+    
+    # Info command
+    info_parser = subparsers.add_parser("info", help="Show system information")
+    
+    return parser
+
+
+async def main():
+    """Main CLI entry point."""
+    parser = create_parser()
+    args = parser.parse_args()
+    
+    # Handle no command
+    if not args.command:
+        parser.print_help()
+        return 1
+    
+    # Create CLI instance
+    cli = PutnamCLI()
+    cli.verbose = args.verbose
+    
+    # Route to appropriate command
+    try:
+        if args.command == "solve":
+            return await cli.cmd_solve(args)
+        elif args.command == "test":
+            return await cli.cmd_test(args)
+        elif args.command == "health":
+            return await cli.cmd_health(args)
+        elif args.command == "benchmark":
+            return await cli.cmd_benchmark(args)
+        elif args.command == "batch":
+            return await cli.cmd_batch(args)
+        elif args.command == "multi-test":
+            return await cli.cmd_multi_test(args)
+        elif args.command == "info":
+            return await cli.cmd_info(args)
+        else:
+            print(f"❌ Unknown command: {args.command}")
+            return 1
+            
+    except KeyboardInterrupt:
+        print("\n⏸️ Operation interrupted by user")
+        return 1
+    except Exception as e:
+        print(f"❌ Error: {str(e)}")
+        if cli.verbose:
+            import traceback
+            traceback.print_exc()
+        return 1
+
+
+if __name__ == "__main__":
+    exit(asyncio.run(main())) 
+\ No newline at end of file
diff --git a/putnam-bench-anon/requirements-local.txt b/putnam-bench-anon/requirements-local.txt
new file mode 100644
index 0000000..fbcab3f
--- /dev/null
+++ b/putnam-bench-anon/requirements-local.txt
@@ -0,0 +1,27 @@
+# Requirements for local model support (GPU recommended)
+# Base requirements
+-r requirements-minimal.txt
+
+# PyTorch with CUDA support (install separately with specific CUDA version)
+# pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
+
+# HuggingFace transformers ecosystem
+transformers>=4.30.0
+accelerate>=0.20.0
+tokenizers>=0.13.0
+
+# Optional: VLLM for high-performance inference
+vllm>=0.3.0
+
+# GPU monitoring and management
+nvidia-ml-py3>=11.0.0
+
+# Model quantization and optimization
+bitsandbytes>=0.39.0
+
+# Additional utilities for local models
+safetensors>=0.3.0
+huggingface-hub>=0.16.0
+
+# Progress bars
+tqdm>=4.60.0 
+\ No newline at end of file
diff --git a/putnam-bench-anon/requirements-minimal.txt b/putnam-bench-anon/requirements-minimal.txt
new file mode 100644
index 0000000..84b34a9
--- /dev/null
+++ b/putnam-bench-anon/requirements-minimal.txt
@@ -0,0 +1,16 @@
+# Minimal requirements for basic functionality
+# Core AI provider clients
+openai>=1.0.0
+anthropic>=0.25.0
+google-generativeai>=0.3.0
+
+# System utilities
+psutil>=5.9.0
+six>=1.16.0
+
+# Basic async support
+aiohttp>=3.8.0
+requests>=2.25.0
+
+# Progress bars
+tqdm>=4.60.0 
+\ No newline at end of file
diff --git a/putnam-bench-anon/requirements.txt b/putnam-bench-anon/requirements.txt
new file mode 100644
index 0000000..d36dc4a
--- /dev/null
+++ b/putnam-bench-anon/requirements.txt
@@ -0,0 +1,35 @@
+# Core AI provider clients
+openai>=1.0.0
+anthropic>=0.25.0
+google-generativeai>=0.3.0
+
+# Local model support
+transformers>=4.30.0
+torch>=2.0.0
+accelerate>=0.20.0
+
+# Optional: VLLM for local model serving
+vllm>=0.3.0
+
+# System utilities
+psutil>=5.9.0
+six>=1.16.0
+
+# Development and utility libraries
+asyncio-throttle>=1.0.0
+aiohttp>=3.8.0
+requests>=2.25.0
+
+# Data handling
+numpy>=1.21.0
+pandas>=1.3.0
+
+# JSON and data processing
+jsonschema>=4.0.0
+
+# Math and scientific computing
+sympy>=1.9.0
+matplotlib>=3.5.0
+
+# Progress bars and UI
+tqdm>=4.60.0 
+\ No newline at end of file
diff --git a/putnam-bench-anon/scripts/__init__.py b/putnam-bench-anon/scripts/__init__.py
new file mode 100644
index 0000000..389f811
--- /dev/null
+++ b/putnam-bench-anon/scripts/__init__.py
@@ -0,0 +1 @@
+"""Scripts package for Putnam mathematical problem solver.""" 
+\ No newline at end of file
diff --git a/putnam-bench-anon/scripts/batch_evaluate.py b/putnam-bench-anon/scripts/batch_evaluate.py
new file mode 100644
index 0000000..6fde90b
--- /dev/null
+++ b/putnam-bench-anon/scripts/batch_evaluate.py
@@ -0,0 +1,1211 @@
+#!/usr/bin/env python3
+"""
+Batch evaluation script for processing entire datasets with multiple providers.
+
+This script efficiently processes all JSON files in the dataset directory,
+supports multiple AI providers, and generates comprehensive evaluation reports.
+
+Features:
+- Incremental saving: Results are saved after each problem completes
+- Simple resume support: Skip already completed problems based on checkpoint
+- Multi-provider support
+- Comprehensive evaluation reports
+
+Usage:
+    python batch_evaluate.py --provider openai --output results/openai_results.json
+    python batch_evaluate.py --provider anthropic --variant kernel_variant --max-concurrent 5
+    
+Resume usage (simplified):
+    # Resume with same configuration
+    python batch_evaluate.py --provider openai --dataset dataset/ --resume checkpoint_file.json
+    
+    # Resume with different settings (checkpoint only provides skip list)
+    python batch_evaluate.py --provider openai --dataset dataset/ --concurrent 10 --resume checkpoint_file.json
+"""
+
+import asyncio
+import json
+import sys
+import time
+from pathlib import Path
+import argparse
+from typing import List, Dict, Any
+import logging
+from datetime import datetime
+import shutil
+
+try:
+    from tqdm import tqdm
+    HAS_TQDM = True
+except ImportError:
+    HAS_TQDM = False
+    # Fallback progress bar
+    class tqdm:
+        def __init__(self, total=None, desc=None, **kwargs):
+            self.total = total
+            self.n = 0
+            self.desc = desc
+            print(f"{desc}: Starting...")
+        
+        def update(self, n=1):
+            self.n += n
+            if self.total:
+                percent = (self.n / self.total) * 100
+                print(f"{self.desc}: {self.n}/{self.total} ({percent:.1f}%)", end='\r')
+        
+        def set_postfix(self, postfix_dict):
+            pass
+        
+        def close(self):
+            print()  # New line after progress
+
+# Add the loader module to the path
+sys.path.append(str(Path(__file__).parent))
+
+from loader import create_loader, get_supported_providers
+
+
+def setup_logging(output_dir: Path):
+    """Setup logging configuration."""
+    log_file = output_dir / f"evaluation_{datetime.now().strftime('%Y%m%d_%H%M%S')}.log"
+    
+    logging.basicConfig(
+        level=logging.INFO,
+        format='%(asctime)s - %(levelname)s - %(message)s',
+        handlers=[
+            logging.FileHandler(log_file),
+            logging.StreamHandler(sys.stdout)
+        ]
+    )
+    
+    return logging.getLogger(__name__)
+
+
+async def load_dataset(dataset_path: Path, max_files: int = None) -> List[Dict[str, Any]]:
+    """Load all JSON files from the dataset directory."""
+    json_files = list(dataset_path.glob("*.json"))
+    
+    if max_files:
+        json_files = json_files[:max_files]
+    
+    problems = []
+    for json_file in json_files:
+        try:
+            with open(json_file, 'r', encoding='utf-8') as f:
+                data = json.load(f)
+                data['_source_file'] = str(json_file.name)
+                problems.append(data)
+        except Exception as e:
+            logging.warning(f"Failed to load {json_file}: {str(e)}")
+    
+    return problems
+
+
+async def process_single_problem(loader, problem_data: Dict[str, Any], 
+                                variant_type: str, solver_model: str = None, 
+                                grader_model: str = None) -> Dict[str, Any]:
+    """Process a single problem and return results with metadata."""
+    start_time = time.time()
+    
+    try:
+        result = await loader.test_single_problem(
+            problem_data,
+            variant_type=variant_type,
+            solver_model=solver_model,
+            grader_model=grader_model
+        )
+        
+        # Add metadata
+        result['_metadata'] = {
+            'source_file': problem_data.get('_source_file', 'unknown'),
+            'variant_type': variant_type,
+            'processing_time': time.time() - start_time,
+            'timestamp': datetime.now().isoformat(),
+            'models_used': {
+                'solver': solver_model or loader.solver_model,
+                'grader': grader_model or loader.grader_model
+            }
+        }
+        
+        return result
+        
+    except Exception as e:
+        # Return error information
+        return {
+            'error': str(e),
+            'final_grade': 0,
+            '_metadata': {
+                'source_file': problem_data.get('_source_file', 'unknown'),
+                'variant_type': variant_type,
+                'processing_time': time.time() - start_time,
+                'timestamp': datetime.now().isoformat(),
+                'error': True
+            }
+        }
+
+
+async def batch_evaluate(dataset_path: Path = None, provider: str = None, variant_type: str = "original",
+                        max_concurrent: int = 3, max_files: int = None,
+                        solver_model: str = None, grader_model: str = None,
+                        output_file: Path = None, resume_checkpoint: Path = None,
+                        **loader_kwargs) -> Dict[str, Any]:
+    """
+    Batch evaluate problems using specified provider with resume support.
+    
+    Args:
+        dataset_path: Path to dataset directory (required for new runs or old checkpoint format)
+        provider: AI provider name (required for new runs or old checkpoint format)
+        variant_type: Problem variant to use
+        max_concurrent: Maximum concurrent evaluations
+        max_files: Maximum number of files to process (None for all)
+        solver_model: Override solver model
+        grader_model: Override grader model
+        output_file: Output file path
+        resume_checkpoint: Path to checkpoint file to resume from
+        **loader_kwargs: Additional arguments for loader
+        
+    Returns:
+        Dictionary with evaluation results and statistics
+    """
+    logger = logging.getLogger(__name__)
+    
+    # Check if resuming from checkpoint
+    if resume_checkpoint and resume_checkpoint.exists():
+        logger.info(f"Resuming from checkpoint: {resume_checkpoint}")
+        with open(resume_checkpoint, 'r', encoding='utf-8') as f:
+            checkpoint_data = json.load(f)
+        
+        # Simple resume: just restore completed indices and results
+        completed_indices = set(checkpoint_data.get('completed_indices', []))
+        results = checkpoint_data.get('results', [])
+        failed_indices = checkpoint_data.get('failed_indices', [])
+        successful_indices = checkpoint_data.get('successful_indices', [])
+        correct_indices = checkpoint_data.get('correct_indices', [])
+        
+        # Always require dataset_path and provider from command line
+        if not dataset_path:
+            raise ValueError("dataset_path is required when resuming")
+        if not provider:
+            raise ValueError("provider is required when resuming")
+        
+        # Load dataset
+        logger.info(f"Loading dataset from {dataset_path}")
+        problems = await load_dataset(dataset_path, max_files)
+        logger.info(f"Loaded {len(problems)} problems")
+        
+        if not problems:
+            raise ValueError("No problems found in dataset")
+        
+        checkpoint_file = resume_checkpoint  # Continue using the same checkpoint file
+        logger.info(f"Resuming with {len(completed_indices)} completed problems out of {len(problems)}")
+    else:
+        # New evaluation - validate required parameters
+        if not dataset_path:
+            raise ValueError("dataset_path is required for new evaluation")
+        if not provider:
+            raise ValueError("provider is required for new evaluation")
+        
+        # Load dataset
+        logger.info(f"Loading dataset from {dataset_path}")
+        problems = await load_dataset(dataset_path, max_files)
+        logger.info(f"Loaded {len(problems)} problems")
+        
+        if not problems:
+            raise ValueError("No problems found in dataset")
+        
+        # Initialize state for new run
+        completed_indices = set()
+        results = []
+        failed_indices = []
+        successful_indices = []
+        correct_indices = []
+        
+        # Create checkpoint file name
+        timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
+        if output_file:
+            checkpoint_file = output_file.parent / f"checkpoint_{output_file.stem}_{timestamp}.json"
+        else:
+            checkpoint_file = Path(f"checkpoint_{provider}_{variant_type}_{timestamp}.json")
+    
+    # Create loader
+    logger.info(f"Creating {provider} loader")
+    
+    # Include solver_model and grader_model in loader_kwargs if specified
+    if solver_model:
+        loader_kwargs['solver_model'] = solver_model
+    if grader_model:
+        loader_kwargs['grader_model'] = grader_model
+    
+    loader = create_loader(provider, **loader_kwargs)
+    
+    # Health check
+    logger.info("Performing health check...")
+    if not await loader.health_check():
+        raise RuntimeError(f"Health check failed for {provider}")
+    
+    # Cost estimation
+    logger.info("Estimating costs...")
+    cost_info = await loader.estimate_cost(len(problems))
+    logger.info(f"Estimated cost: ${cost_info.get('total_cost', 0):.2f}")
+    
+    # Progress tracking
+    remaining_problems = [p for p in problems if p.get('index', 'unknown') not in completed_indices]
+    progress_bar = tqdm(total=len(problems), desc=f"Evaluating with {provider}", initial=len(completed_indices))
+    
+    # Semaphore for concurrency control
+    semaphore = asyncio.Semaphore(max_concurrent)
+    
+    def save_checkpoint():
+        """Save current state to checkpoint file - simplified version"""
+        checkpoint_data = {
+            'timestamp': datetime.now().isoformat(),
+            # Only save essential state information
+            'completed_indices': list(completed_indices),
+            'successful_indices': successful_indices,
+            'failed_indices': failed_indices,
+            'correct_indices': correct_indices,
+            'results': results,
+            # Save minimal config for reference (not for resume)
+            'dataset_path': str(dataset_path),  # For convenience
+            'total_problems': len(problems),
+            'current_config': {
+                'provider': provider,
+                'variant_type': variant_type,
+                'solver_model': loader.solver_model,
+                'grader_model': loader.grader_model
+            }
+        }
+        
+        # Write to temporary file first, then move (atomic operation)
+        temp_file = checkpoint_file.with_suffix('.tmp')
+        with open(temp_file, 'w', encoding='utf-8') as f:
+            json.dump(checkpoint_data, f, indent=2, ensure_ascii=False)
+        
+        # Atomic rename
+        temp_file.replace(checkpoint_file)
+    
+    async def evaluate_problem(problem_data: Dict[str, Any]) -> Dict[str, Any]:
+        """Evaluate a single problem with concurrency control."""
+        problem_index = problem_data.get('index', 'unknown')
+        
+        # Skip if already completed
+        if problem_index in completed_indices:
+            return None
+        
+        async with semaphore:
+            try:
+                result = await loader.test_single_problem(
+                    problem_data,
+                    variant_type=variant_type
+                )
+                
+                # Track success/failure based on technical completion, not correctness
+                if result.get('status') == 'completed':
+                    successful_indices.append(result['index'])  # Successfully processed
+                    if result.get('correct'):
+                        correct_indices.append(result['index'])  # Also correct
+                else:
+                    failed_indices.append(result['index'])  # Technical failure
+                
+                # Add to results and mark as completed
+                results.append(result)
+                completed_indices.add(problem_index)
+                
+                # Save checkpoint immediately after each problem
+                save_checkpoint()
+                
+                progress_bar.update(1)
+                progress_bar.set_postfix({
+                    'success': len(successful_indices),
+                    'failed': len(failed_indices),
+                    'saved': len(completed_indices)
+                })
+                
+                return result
+                
+            except Exception as e:
+                logger.error(f"Error evaluating problem {problem_index}: {e}")
+                result = {
+                    'index': problem_index,
+                    'status': 'error',
+                    'error': str(e),
+                    'error_type': type(e).__name__
+                }
+                
+                # Add to results and mark as completed (even if failed)
+                results.append(result)
+                failed_indices.append(problem_index)
+                completed_indices.add(problem_index)
+                
+                # Save checkpoint
+                save_checkpoint()
+                
+                progress_bar.update(1)
+                progress_bar.set_postfix({
+                    'success': len(successful_indices),
+                    'failed': len(failed_indices),
+                    'saved': len(completed_indices)
+                })
+                
+                return result
+    
+    # Run evaluations
+    start_time = time.time()
+    
+    try:
+        # Create tasks only for remaining problems
+        tasks = [evaluate_problem(problem) for problem in remaining_problems]
+        
+        if tasks:
+            # Execute all tasks concurrently (limited by semaphore)
+            await asyncio.gather(*tasks)
+        else:
+            logger.info("All problems already completed!")
+    
+    except KeyboardInterrupt:
+        logger.info("Evaluation interrupted by user. Progress saved to checkpoint.")
+        logger.info(f"To resume, use: --resume {checkpoint_file}")
+        raise
+    
+    finally:
+        progress_bar.close()
+    
+    # Calculate statistics
+    total_time = time.time() - start_time
+    completed_results = [r for r in results if r.get('status') == 'completed']
+    grades = [r['grade']['grade'] for r in completed_results 
+              if r.get('grade', {}).get('status') == 'success' and 'grade' in r.get('grade', {})]
+    
+    # Calculate numeric grades (CORRECT=5, INCORRECT=2.5)
+    numeric_grades = [5.0 if g == 'CORRECT' else 2.5 for g in grades]
+    average_grade = sum(numeric_grades) / len(numeric_grades) if numeric_grades else 0.0
+    
+    summary = {
+        'total_problems': len(problems),
+        'completed': len(completed_results),
+        'successful': len(successful_indices),  # Technical success (completed processing)
+        'failed': len(failed_indices),  # Technical failures
+        'correct_answers': len(correct_indices),  # Mathematically correct answers
+        'incorrect_answers': len(successful_indices) - len(correct_indices),  # Wrong but processed
+        'success_rate': (len(successful_indices) / len(problems) * 100) if problems else 0,  # Technical success rate
+        'accuracy_rate': (len(correct_indices) / len(successful_indices) * 100) if successful_indices else 0,  # Correctness rate
+        'average_grade': average_grade,
+        'total_time_seconds': total_time,
+        'problems_per_second': len(problems) / total_time if total_time > 0 else 0,
+        'provider': provider,
+        'variant_type': variant_type,
+        'solver_model': loader.solver_model,
+        'grader_model': loader.grader_model,
+        'max_concurrent': max_concurrent,
+        'estimated_cost': cost_info,
+        'checkpoint_file': str(checkpoint_file)
+    }
+    
+    # Create full results
+    full_results = {
+        'summary': summary,
+        'problems': results,
+        'successful_indices': successful_indices,  # Technical successes
+        'failed_indices': failed_indices,  # Technical failures
+        'correct_indices': correct_indices,  # Correct answers
+        'timestamp': datetime.now().isoformat()
+    }
+    
+    # Save final results
+    if output_file:
+        logger.info(f"Saving final results to {output_file}")
+        with open(output_file, 'w', encoding='utf-8') as f:
+            json.dump(full_results, f, indent=2, ensure_ascii=False)
+        
+        # Clean up checkpoint file after successful completion
+        if checkpoint_file.exists():
+            logger.info(f"Removing checkpoint file: {checkpoint_file}")
+            checkpoint_file.unlink()
+    
+    # Print summary
+    logger.info(f"\n{'='*60}")
+    logger.info("EVALUATION SUMMARY")
+    logger.info(f"{'='*60}")
+    logger.info(f"Provider: {provider}")
+    logger.info(f"Variant: {variant_type}")
+    logger.info(f"Total problems: {summary['total_problems']}")
+    logger.info(f"✅ Successfully processed: {summary['successful']} ({summary['success_rate']:.1f}%)")
+    logger.info(f"💥 Technical failures: {summary['failed']}")
+    logger.info(f"🎯 Correct answers: {summary['correct_answers']} ({summary['accuracy_rate']:.1f}% of processed)")
+    logger.info(f"❌ Wrong answers: {summary['incorrect_answers']}")
+    logger.info(f"Average grade: {summary['average_grade']:.2f}")
+    logger.info(f"Total time: {summary['total_time_seconds']:.1f}s")
+    logger.info(f"Speed: {summary['problems_per_second']:.2f} problems/second")
+    
+    # Cleanup
+    if hasattr(loader, '__aexit__'):
+        await loader.__aexit__(None, None, None)
+    
+    return full_results
+
+
+async def batch_evaluate_cross(dataset_path: Path = None, 
+                             solver_provider: str = None,
+                             grader_provider: str = None,
+                             variant_type: str = "original",
+                             max_concurrent: int = 3, 
+                             max_files: int = None,
+                             solver_model: str = None, 
+                             grader_model: str = None,
+                             output_file: Path = None,
+                             resume_checkpoint: Path = None,
+                             vllm_url: str = None,
+                             device: str = None,
+                             quick: bool = False) -> Dict[str, Any]:
+    """
+    Batch evaluate problems using different providers for solving and grading with resume support.
+    
+    Args:
+        dataset_path: Path to dataset directory (required for new runs, ignored for resume)
+        solver_provider: Provider for solving problems (required for new runs, ignored for resume)
+        grader_provider: Provider for grading (if None, uses solver_provider)
+        variant_type: Problem variant to use
+        max_concurrent: Maximum concurrent evaluations
+        max_files: Maximum number of files to process (None for all)
+        solver_model: Override solver model
+        grader_model: Override grader model
+        output_file: Output file path
+        resume_checkpoint: Path to checkpoint file to resume from
+        vllm_url: VLLM server URL if using VLLM
+        device: Device for HuggingFace models
+        
+    Returns:
+        Dictionary with evaluation results and statistics
+    """
+    logger = logging.getLogger(__name__)
+    
+    # Check if resuming from checkpoint
+    if resume_checkpoint and resume_checkpoint.exists():
+        logger.info(f"Resuming from checkpoint: {resume_checkpoint}")
+        with open(resume_checkpoint, 'r', encoding='utf-8') as f:
+            checkpoint_data = json.load(f)
+        
+        # Simple resume: just restore completed indices and results
+        completed_indices = set(checkpoint_data.get('completed_indices', []))
+        results = checkpoint_data.get('results', [])
+        failed_indices = checkpoint_data.get('failed_indices', [])
+        successful_indices = checkpoint_data.get('successful_indices', [])
+        correct_indices = checkpoint_data.get('correct_indices', [])
+        
+        # Always require providers and dataset_path from command line
+        if not dataset_path:
+            raise ValueError("dataset_path is required when resuming")
+        if not solver_provider:
+            raise ValueError("solver_provider is required when resuming")
+        
+        # Load dataset
+        logger.info(f"Loading dataset from {dataset_path}")
+        problems = await load_dataset(dataset_path, max_files)
+        logger.info(f"Loaded {len(problems)} problems")
+        
+        if not problems:
+            raise ValueError("No problems found in dataset")
+        
+        checkpoint_file = resume_checkpoint  # Continue using the same checkpoint file
+        logger.info(f"Resuming with {len(completed_indices)} completed problems out of {len(problems)}")
+    else:
+        # New evaluation - validate required parameters
+        if not dataset_path:
+            raise ValueError("dataset_path is required for new evaluation")
+        if not solver_provider:
+            raise ValueError("solver_provider is required for new evaluation")
+        
+        # Load dataset
+        logger.info(f"Loading dataset from {dataset_path}")
+        problems = await load_dataset(dataset_path, max_files)
+        logger.info(f"Loaded {len(problems)} problems")
+        
+        if not problems:
+            raise ValueError("No problems found in dataset")
+        
+        # Initialize state for new run
+        completed_indices = set()
+        results = []
+        failed_indices = []
+        successful_indices = []
+        correct_indices = []
+        
+        # Create checkpoint file name
+        timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
+        if output_file:
+            checkpoint_file = output_file.parent / f"checkpoint_{output_file.stem}_{timestamp}.json"
+        else:
+            checkpoint_file = Path(f"checkpoint_cross_{solver_provider}_{grader_provider or solver_provider}_{variant_type}_{timestamp}.json")
+    
+    # Create cross-provider loader
+    logger.info(f"Creating cross-provider loader: solver={solver_provider}, grader={grader_provider or solver_provider}")
+    
+    from loader import create_cross_provider_loader
+    
+    # Prepare kwargs for each provider
+    loader_kwargs = {}
+    
+    # VLLM settings
+    if vllm_url:
+        if solver_provider == 'vllm':
+            loader_kwargs['solver_kwargs'] = {'base_url': vllm_url}
+        if grader_provider == 'vllm':
+            loader_kwargs['grader_kwargs'] = {'base_url': vllm_url}
+    
+    # HuggingFace settings
+    if device:
+        if solver_provider == 'huggingface':
+            loader_kwargs['solver_kwargs'] = {'device': device}
+        if grader_provider == 'huggingface':
+            loader_kwargs['grader_kwargs'] = {'device': device}
+    
+    # Add quick mode if specified
+    if quick:
+        loader_kwargs['quick'] = True
+    
+    loader = create_cross_provider_loader(
+        solver_provider=solver_provider,
+        grader_provider=grader_provider,
+        solver_model=solver_model,
+        grader_model=grader_model,
+        **loader_kwargs
+    )
+    
+    # Health check
+    logger.info("Performing health check...")
+    if not await loader.health_check():
+        raise RuntimeError(f"Health check failed")
+    
+    # Cost estimation
+    logger.info("Estimating costs...")
+    cost_info = await loader.estimate_cost(len(problems))
+    logger.info(f"Estimated cost: ${cost_info.get('total_cost', 0):.2f}")
+    
+    # Progress tracking
+    remaining_problems = [p for p in problems if p.get('index', 'unknown') not in completed_indices]
+    progress_bar = tqdm(total=len(problems), desc=f"Evaluating (solver={solver_provider}, grader={grader_provider or solver_provider})", initial=len(completed_indices))
+    
+    # Semaphore for concurrency control
+    semaphore = asyncio.Semaphore(max_concurrent)
+    
+    def save_checkpoint():
+        """Save current state to checkpoint file - simplified version"""
+        checkpoint_data = {
+            'timestamp': datetime.now().isoformat(),
+            # Only save essential state information
+            'completed_indices': list(completed_indices),
+            'successful_indices': successful_indices,
+            'failed_indices': failed_indices,
+            'correct_indices': correct_indices,
+            'results': results,
+            # Save minimal config for reference (not for resume)
+            'dataset_path': str(dataset_path),  # For convenience
+            'total_problems': len(problems),
+            'current_config': {
+                'solver_provider': solver_provider,
+                'grader_provider': grader_provider or solver_provider,
+                'variant_type': variant_type,
+                'solver_model': loader.solver_model,
+                'grader_model': loader.grader_model
+            }
+        }
+        
+        # Write to temporary file first, then move (atomic operation)
+        temp_file = checkpoint_file.with_suffix('.tmp')
+        with open(temp_file, 'w', encoding='utf-8') as f:
+            json.dump(checkpoint_data, f, indent=2, ensure_ascii=False)
+        
+        # Atomic rename
+        temp_file.replace(checkpoint_file)
+    
+    async def evaluate_problem(problem_data: Dict[str, Any]) -> Dict[str, Any]:
+        """Evaluate a single problem with concurrency control."""
+        problem_index = problem_data.get('index', 'unknown')
+        
+        # Skip if already completed
+        if problem_index in completed_indices:
+            return None
+        
+        async with semaphore:
+            try:
+                result = await loader.test_single_problem(
+                    problem_data,
+                    variant_type=variant_type
+                )
+                
+                # Track success/failure based on technical completion, not correctness
+                if result.get('status') == 'completed':
+                    successful_indices.append(result['index'])  # Successfully processed
+                    if result.get('correct'):
+                        correct_indices.append(result['index'])  # Also correct
+                else:
+                    failed_indices.append(result['index'])  # Technical failure
+                
+                # Add to results and mark as completed
+                results.append(result)
+                completed_indices.add(problem_index)
+                
+                # Save checkpoint immediately after each problem
+                save_checkpoint()
+                
+                progress_bar.update(1)
+                progress_bar.set_postfix({
+                    'success': len(successful_indices),
+                    'failed': len(failed_indices),
+                    'saved': len(completed_indices)
+                })
+                
+                return result
+                
+            except Exception as e:
+                import traceback
+                
+                # Capture full error details
+                error_details = {
+                    'error_message': str(e),
+                    'error_type': type(e).__name__,
+                    'traceback': traceback.format_exc(),
+                    'timestamp': datetime.now().isoformat(),
+                    'problem_index': problem_index,
+                    'problem_title': problem_data.get('title', 'unknown')
+                }
+                
+                # Try to capture HTTP-specific details if available
+                if hasattr(e, 'response'):
+                    try:
+                        error_details['http_status'] = e.response.status_code
+                        error_details['http_headers'] = dict(e.response.headers)
+                        error_details['http_response_text'] = e.response.text
+                    except:
+                        pass
+                
+                # Try to capture request details if available
+                if hasattr(e, 'request'):
+                    try:
+                        error_details['request_method'] = e.request.method
+                        error_details['request_url'] = e.request.url
+                        error_details['request_headers'] = dict(e.request.headers)
+                        # Don't log request body as it might contain sensitive info
+                    except:
+                        pass
+                
+                # Log detailed error
+                logger.error(f"DETAILED ERROR for problem {problem_index}:")
+                logger.error(f"  Error Type: {error_details['error_type']}")
+                logger.error(f"  Error Message: {error_details['error_message']}")
+                logger.error(f"  Problem Title: {error_details['problem_title']}")
+                
+                if 'http_status' in error_details:
+                    logger.error(f"  HTTP Status: {error_details['http_status']}")
+                    logger.error(f"  HTTP Response: {error_details['http_response_text'][:500]}...")
+                
+                logger.error(f"  Full Traceback:\n{error_details['traceback']}")
+                
+                # Save to detailed error log
+                error_log_file = output_file.parent / f"detailed_errors_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json" if output_file else Path(f"detailed_errors_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json")
+                
+                try:
+                    # Load existing errors if file exists
+                    if error_log_file.exists():
+                        with open(error_log_file, 'r') as f:
+                            existing_errors = json.load(f)
+                    else:
+                        existing_errors = []
+                    
+                    # Add new error
+                    existing_errors.append(error_details)
+                    
+                    # Save updated errors
+                    with open(error_log_file, 'w') as f:
+                        json.dump(existing_errors, f, indent=2, ensure_ascii=False)
+                    
+                    logger.info(f"Detailed error saved to {error_log_file}")
+                    
+                except Exception as save_error:
+                    logger.error(f"Failed to save detailed error log: {save_error}")
+                
+                result = {
+                    'index': problem_index,
+                    'status': 'error',
+                    'error': str(e),
+                    'error_type': type(e).__name__,
+                    'error_details': error_details
+                }
+                
+                # Add to results and mark as completed (even if failed)
+                results.append(result)
+                failed_indices.append(problem_index)
+                completed_indices.add(problem_index)
+                
+                # Save checkpoint
+                save_checkpoint()
+                
+                progress_bar.update(1)
+                progress_bar.set_postfix({
+                    'success': len(successful_indices),
+                    'failed': len(failed_indices),
+                    'saved': len(completed_indices)
+                })
+                
+                return result
+    
+    # Run evaluations
+    start_time = time.time()
+    
+    try:
+        # Create tasks only for remaining problems
+        tasks = [evaluate_problem(problem) for problem in remaining_problems]
+        
+        if tasks:
+            # Execute all tasks concurrently (limited by semaphore)
+            await asyncio.gather(*tasks)
+        else:
+            logger.info("All problems already completed!")
+    
+    except KeyboardInterrupt:
+        logger.info("Evaluation interrupted by user. Progress saved to checkpoint.")
+        logger.info(f"To resume, use: --resume {checkpoint_file}")
+        raise
+    
+    finally:
+        progress_bar.close()
+    
+    # Calculate statistics
+    total_time = time.time() - start_time
+    completed_results = [r for r in results if r.get('status') == 'completed']
+    grades = [r['grade']['grade'] for r in completed_results 
+              if r.get('grade', {}).get('status') == 'success' and 'grade' in r.get('grade', {})]
+    
+    # Calculate numeric grades (CORRECT=5, INCORRECT=2.5)
+    numeric_grades = [5.0 if g == 'CORRECT' else 2.5 for g in grades]
+    average_grade = sum(numeric_grades) / len(numeric_grades) if numeric_grades else 0.0
+    
+    model_info = loader.get_model_info()
+    
+    summary = {
+        'total_problems': len(problems),
+        'completed': len(completed_results),
+        'successful': len(successful_indices),  # Technical success (completed processing)
+        'failed': len(failed_indices),  # Technical failures
+        'correct_answers': len(correct_indices),  # Mathematically correct answers
+        'incorrect_answers': len(successful_indices) - len(correct_indices),  # Wrong but processed
+        'success_rate': (len(successful_indices) / len(problems) * 100) if problems else 0,  # Technical success rate
+        'accuracy_rate': (len(correct_indices) / len(successful_indices) * 100) if successful_indices else 0,  # Correctness rate
+        'average_grade': average_grade,
+        'total_time_seconds': total_time,
+        'problems_per_second': len(problems) / total_time if total_time > 0 else 0,
+        'solver_provider': model_info.get('solver_provider', solver_provider),
+        'grader_provider': model_info.get('grader_provider', grader_provider or solver_provider),
+        'variant_type': variant_type,
+        'solver_model': loader.solver_model,
+        'grader_model': loader.grader_model,
+        'max_concurrent': max_concurrent,
+        'estimated_cost': cost_info,
+        'is_cross_provider': model_info.get('is_cross_provider', False)
+    }
+    
+    # Create full results
+    full_results = {
+        'summary': summary,
+        'problems': results,
+        'successful_indices': successful_indices,  # Technical successes
+        'failed_indices': failed_indices,  # Technical failures
+        'correct_indices': correct_indices,  # Correct answers
+        'timestamp': datetime.now().isoformat()
+    }
+    
+    # Save if requested
+    if output_file:
+        logger.info(f"Saving results to {output_file}")
+        with open(output_file, 'w', encoding='utf-8') as f:
+            json.dump(full_results, f, indent=2, ensure_ascii=False)
+    
+    # Print summary
+    logger.info(f"\n{'='*60}")
+    logger.info("CROSS-PROVIDER EVALUATION SUMMARY")
+    logger.info(f"{'='*60}")
+    logger.info(f"Solver Provider: {summary['solver_provider']} ({loader.solver_model})")
+    logger.info(f"Grader Provider: {summary['grader_provider']} ({loader.grader_model})")
+    logger.info(f"Variant: {variant_type}")
+    logger.info(f"Total problems: {summary['total_problems']}")
+    logger.info(f"✅ Successfully processed: {summary['successful']} ({summary['success_rate']:.1f}%)")
+    logger.info(f"💥 Technical failures: {summary['failed']}")
+    logger.info(f"🎯 Correct answers: {summary['correct_answers']} ({summary['accuracy_rate']:.1f}% of processed)")
+    logger.info(f"❌ Wrong answers: {summary['incorrect_answers']}")
+    logger.info(f"Average grade: {summary['average_grade']:.2f}")
+    logger.info(f"Total time: {summary['total_time_seconds']:.1f}s")
+    logger.info(f"Speed: {summary['problems_per_second']:.2f} problems/second")
+    
+    # Cleanup
+    if hasattr(loader, '__aexit__'):
+        await loader.__aexit__(None, None, None)
+    
+    return full_results
+
+
+async def batch_evaluate_all_variants(dataset_path: Path, provider: str, 
+                                     variants: List[str] = None,
+                                     max_concurrent: int = 3, max_files: int = None,
+                                     solver_model: str = None, grader_model: str = None,
+                                     output_dir: Path = None, 
+                                     base_url: str = None, device: str = None) -> Dict[str, Any]:
+    """
+    Batch evaluate problems across all variants using specified provider.
+    
+    Args:
+        dataset_path: Path to dataset directory
+        provider: AI provider name
+        variants: List of variants to test (None for all)
+        max_concurrent: Maximum concurrent evaluations
+        max_files: Maximum number of files to process per variant (None for all)
+        solver_model: Override solver model
+        grader_model: Override grader model
+        output_dir: Output directory path
+        **loader_kwargs: Additional arguments for loader
+        
+    Returns:
+        Dictionary with all variant results and comparative analysis
+    """
+    if variants is None:
+        variants = ["original", "descriptive_long", "descriptive_long_confusing", 
+                   "descriptive_long_misleading", "garbled_string", "kernel_variant"]
+    
+    if output_dir is None:
+        output_dir = Path("results")
+    
+    logger = logging.getLogger(__name__)
+    
+    timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
+    config_name = f"{provider}"
+    if solver_model:
+        config_name += f"_{solver_model.replace('/', '_').replace('-', '_')}"
+    
+    # Create configuration-specific output directory
+    config_output_dir = output_dir / f"{config_name}_{timestamp}"
+    config_output_dir.mkdir(parents=True, exist_ok=True)
+    
+    # Prepare loader kwargs based on provider
+    loader_kwargs = {}
+    if provider == 'vllm' and base_url:
+        loader_kwargs['base_url'] = base_url
+    elif provider == 'huggingface' and device:
+        loader_kwargs['device'] = device
+    
+    logger.info(f"🚀 Starting multi-variant test for {config_name}")
+    logger.info(f"📊 Testing {len(variants)} variants with up to {max_files or 'ALL'} files each")
+    
+    overall_start_time = time.time()
+    variant_results = {}
+    
+    # Create overall progress bar for variants if tqdm is available
+    if HAS_TQDM:
+        variant_progress = tqdm.tqdm(total=len(variants), desc="Variants", 
+                                   unit="variant", position=1, leave=True)
+    
+    for i, variant in enumerate(variants):
+        logger.info(f"\n📝 [{i+1}/{len(variants)}] Testing variant: {variant}")
+        variant_start_time = time.time()
+        
+        # Output file for this variant
+        variant_output_file = config_output_dir / f"{variant}_{timestamp}.json"
+        
+        try:
+            # Run batch evaluation for this variant
+            result = await batch_evaluate(
+                dataset_path=dataset_path,
+                provider=provider,
+                variant_type=variant,
+                max_concurrent=max_concurrent,
+                max_files=max_files,
+                solver_model=solver_model,
+                grader_model=grader_model,
+                output_file=variant_output_file,
+                **loader_kwargs
+            )
+            
+            variant_time = time.time() - variant_start_time
+            
+            # Extract key metrics
+            summary = result.get('summary', {})
+            variant_results[variant] = {
+                'status': 'success',
+                'output_file': str(variant_output_file),
+                'total_problems': summary.get('total_problems', 0),
+                'successful_evaluations': summary.get('successful', 0),
+                'correct_evaluations': summary.get('correct_answers', 0),
+                'incorrect_evaluations': summary.get('incorrect_answers', 0),
+                'failed_evaluations': summary.get('failed', 0),
+                'success_rate': summary.get('success_rate', 0),
+                'average_grade': summary.get('average_grade', 0),
+                'total_processing_time': summary.get('total_time_seconds', 0),
+                'avg_time_per_problem': summary.get('problems_per_second', 0),
+                'variant_test_time': variant_time,
+                'grade_distribution': result.get('problems', []) # Assuming 'problems' contains all results
+            }
+            
+            logger.info(f"✅ {variant}: "
+                       f"Grade {summary.get('average_grade', 0):.2f}, "
+                       f"Success {summary.get('success_rate', 0):.1f}%, "
+                       f"Time {variant_time/60:.1f}min")
+            
+        except Exception as e:
+            variant_time = time.time() - variant_start_time
+            error_msg = str(e)
+            
+            variant_results[variant] = {
+                'status': 'failed',
+                'error': error_msg,
+                'variant_test_time': variant_time
+            }
+            
+            logger.error(f"❌ {variant} failed: {error_msg}")
+        
+        # Update variant progress bar
+        if HAS_TQDM and 'variant_progress' in locals():
+            variant_progress.update(1)
+            successful_variants_count = len([v for v, r in variant_results.items() if r.get('status') == 'success'])
+            variant_progress.set_postfix({
+                'Success': successful_variants_count,
+                'Failed': len(variant_results) - successful_variants_count
+            })
+    
+    # Close variant progress bar
+    if HAS_TQDM and 'variant_progress' in locals():
+        variant_progress.close()
+    
+    overall_time = time.time() - overall_start_time
+    
+    # Generate comprehensive summary
+    successful_variants = [v for v, r in variant_results.items() if r.get('status') == 'success']
+    failed_variants = [v for v, r in variant_results.items() if r.get('status') == 'failed']
+    
+    # Calculate aggregate statistics
+    if successful_variants:
+        total_problems = sum(variant_results[v].get('total_problems', 0) for v in successful_variants)
+        total_successful = sum(variant_results[v].get('successful_evaluations', 0) for v in successful_variants)
+        total_correct = sum(variant_results[v].get('correct_evaluations', 0) for v in successful_variants)
+        total_incorrect = sum(variant_results[v].get('incorrect_evaluations', 0) for v in successful_variants)
+        total_failed = sum(variant_results[v].get('failed_evaluations', 0) for v in successful_variants)
+        
+        grades = [variant_results[v].get('average_grade', 0) for v in successful_variants]
+        success_rates = [variant_results[v].get('success_rate', 0) for v in successful_variants]
+        times = [variant_results[v].get('avg_time_per_problem', 0) for v in successful_variants]
+        
+        overall_avg_grade = sum(grades) / len(grades) if grades else 0
+        overall_success_rate = sum(success_rates) / len(success_rates) if success_rates else 0
+        overall_avg_time = sum(times) / len(times) if times else 0
+        
+        # Find best and worst performing variants
+        best_variant = max(successful_variants, key=lambda v: variant_results[v].get('average_grade', 0))
+        worst_variant = min(successful_variants, key=lambda v: variant_results[v].get('average_grade', 0))
+        
+        fastest_variant = min(successful_variants, key=lambda v: variant_results[v].get('avg_time_per_problem', float('inf')))
+        slowest_variant = max(successful_variants, key=lambda v: variant_results[v].get('avg_time_per_problem', 0))
+    else:
+        total_problems = total_successful = total_correct = total_incorrect = total_failed = 0
+        overall_avg_grade = overall_success_rate = overall_avg_time = 0
+        best_variant = worst_variant = fastest_variant = slowest_variant = None
+    
+    summary_result = {
+        'configuration': {
+            'provider': provider,
+            'solver_model': solver_model,
+            'grader_model': grader_model,
+            'base_url': base_url,
+            'device': device,
+            'timestamp': timestamp
+        },
+        'test_overview': {
+            'total_variants_tested': len(variant_results),
+            'successful_variants': len(successful_variants),
+            'failed_variants': len(failed_variants),
+            'total_test_time_minutes': overall_time / 60,
+            'variants_list': list(variant_results.keys())
+        },
+        'aggregate_metrics': {
+            'total_problems_across_variants': total_problems,
+            'total_successful_evaluations': total_successful,
+            'total_correct_evaluations': total_correct,
+            'total_incorrect_evaluations': total_incorrect,
+            'total_technical_failures': total_failed,
+            'overall_average_grade': overall_avg_grade,
+            'overall_success_rate': overall_success_rate,
+            'overall_avg_time_per_problem': overall_avg_time
+        },
+        'variant_comparison': {
+            'best_performing_variant': {
+                'variant': best_variant,
+                'grade': variant_results.get(best_variant, {}).get('average_grade', 0) if best_variant else 0
+            },
+            'worst_performing_variant': {
+                'variant': worst_variant,
+                'grade': variant_results.get(worst_variant, {}).get('average_grade', 0) if worst_variant else 0
+            },
+            'fastest_variant': {
+                'variant': fastest_variant,
+                'time_per_problem': variant_results.get(fastest_variant, {}).get('avg_time_per_problem', 0) if fastest_variant else 0
+            },
+            'slowest_variant': {
+                'variant': slowest_variant,
+                'time_per_problem': variant_results.get(slowest_variant, {}).get('avg_time_per_problem', 0) if slowest_variant else 0
+            }
+        },
+        'detailed_variant_results': variant_results
+    }
+    
+    # Save configuration summary
+    summary_file = config_output_dir / f"SUMMARY_{config_name}_{timestamp}.json"
+    with open(summary_file, 'w', encoding='utf-8') as f:
+        json.dump(summary_result, f, indent=2, ensure_ascii=False)
+    
+    # Print summary to console
+    logger.info("\n" + "="*80)
+    logger.info("📊 MULTI-VARIANT TEST SUMMARY REPORT")
+    logger.info("="*80)
+    
+    logger.info(f"🤖 Provider: {provider}")
+    if solver_model:
+        logger.info(f"🧠 Solver Model: {solver_model}")
+    if grader_model:
+        logger.info(f"📝 Grader Model: {grader_model}")
+    
+    logger.info(f"\n📋 Test Overview:")
+    logger.info(f"   Total variants tested: {len(variant_results)}")
+    logger.info(f"   Successful variants: {len(successful_variants)}")
+    logger.info(f"   Failed variants: {len(failed_variants)}")
+    logger.info(f"   Total test time: {overall_time/60:.1f} minutes")
+    
+    if total_problems > 0:
+        logger.info(f"\n📈 Aggregate Performance:")
+        logger.info(f"   Total problems: {total_problems}")
+        logger.info(f"   Overall average grade: {overall_avg_grade:.2f}")
+        logger.info(f"   Overall success rate: {overall_success_rate:.1f}%")
+        logger.info(f"   Average time per problem: {overall_avg_time:.2f}s")
+    
+    if best_variant:
+        logger.info(f"\n🏆 Variant Performance:")
+        logger.info(f"   Best performing: {best_variant} (Grade: {variant_results[best_variant]['average_grade']:.2f})")
+        logger.info(f"   Worst performing: {worst_variant} (Grade: {variant_results[worst_variant]['average_grade']:.2f})")
+        logger.info(f"   Fastest: {fastest_variant} ({variant_results[fastest_variant]['avg_time_per_problem']:.2f}s/problem)")
+        logger.info(f"   Slowest: {slowest_variant} ({variant_results[slowest_variant]['avg_time_per_problem']:.2f}s/problem)")
+    
+    logger.info("="*80)
+    logger.info(f"💾 Configuration summary saved to {summary_file}")
+    
+    return summary_result
+
+
+async def main():
+    """Main function."""
+    parser = argparse.ArgumentParser(description="Batch evaluate mathematical problems")
+    
+    # Required arguments
+    parser.add_argument("--provider", required=True, choices=get_supported_providers(),
+                       help="AI provider to use")
+    
+    # Dataset options
+    parser.add_argument("--dataset", default="dataset", 
+                       help="Dataset directory path (default: dataset)")
+    parser.add_argument("--variant", default="original",
+                       choices=["original", "descriptive_long", "descriptive_long_confusing", 
+                               "descriptive_long_misleading", "garbled_string", "kernel_variant"],
+                       help="Problem variant to use (default: original)")
+    parser.add_argument("--all-variants", action="store_true",
+                       help="Test all 6 problem variants instead of just one")
+    parser.add_argument("--variants", nargs="+",
+                       choices=["original", "descriptive_long", "descriptive_long_confusing", 
+                               "descriptive_long_misleading", "garbled_string", "kernel_variant"],
+                       help="Specific variants to test (use with --all-variants)")
+    parser.add_argument("--max-files", type=int,
+                       help="Maximum number of files to process per variant (default: all)")
+    
+    # Processing options
+    parser.add_argument("--max-concurrent", type=int, default=3,
+                       help="Maximum concurrent evaluations (default: 3)")
+    parser.add_argument("--solver-model", 
+                       help="Override solver model")
+    parser.add_argument("--grader-model",
+                       help="Override grader model")
+    
+    # Output options
+    parser.add_argument("--output", type=Path,
+                       help="Output file path (default: results/[provider]_[timestamp].json)")
+    parser.add_argument("--output-dir", type=Path, default="results",
+                       help="Output directory (default: results)")
+    parser.add_argument("--resume", type=Path,
+                       help="Path to checkpoint file to resume from")
+    
+    # Provider-specific options
+    parser.add_argument("--base-url", 
+                       help="Base URL for VLLM provider")
+    parser.add_argument("--device", default="auto",
+                       help="Device for HuggingFace provider (auto/cuda/cpu)")
+    
+    args = parser.parse_args()
+    
+    # Setup output directory and logging
+    args.output_dir.mkdir(parents=True, exist_ok=True)
+    logger = setup_logging(args.output_dir)
+    
+    # Default output file if not specified
+    if not args.output:
+        timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
+        args.output = args.output_dir / f"{args.provider}_{args.variant}_{timestamp}.json"
+    
+    # Prepare loader kwargs based on provider
+    loader_kwargs = {}
+    if args.provider == 'vllm' and args.base_url:
+        loader_kwargs['base_url'] = args.base_url
+    elif args.provider == 'huggingface' and args.device:
+        loader_kwargs['device'] = args.device
+    
+    try:
+        if args.all_variants or args.variants:
+            # Multi-variant evaluation
+            variants_to_test = args.variants if args.variants else None
+            results = await batch_evaluate_all_variants(
+                dataset_path=Path(args.dataset),
+                provider=args.provider,
+                variants=variants_to_test,
+                max_concurrent=args.max_concurrent,
+                max_files=args.max_files,
+                solver_model=args.solver_model,
+                grader_model=args.grader_model,
+                output_dir=args.output_dir,
+                base_url=args.base_url,
+                device=args.device
+            )
+            
+            logger.info(f"Multi-variant evaluation completed successfully!")
+            logger.info(f"Overall average grade: {results['aggregate_metrics']['overall_average_grade']:.2f}")
+            logger.info(f"Overall success rate: {results['aggregate_metrics']['overall_success_rate']:.1f}%")
+        else:
+            # Single variant evaluation
+            results = await batch_evaluate(
+                dataset_path=Path(args.dataset),
+                provider=args.provider,
+                variant_type=args.variant,
+                max_concurrent=args.max_concurrent,
+                max_files=args.max_files,
+                solver_model=args.solver_model,
+                grader_model=args.grader_model,
+                output_file=args.output,
+                resume_checkpoint=args.resume,
+                **loader_kwargs
+            )
+            
+            logger.info(f"Batch evaluation completed successfully!")
+            logger.info(f"Average grade: {results['summary']['average_grade']:.2f}")
+            logger.info(f"Success rate: {results['summary']['success_rate']:.1f}%")
+        
+    except KeyboardInterrupt:
+        logger.info("Evaluation interrupted by user")
+    except Exception as e:
+        logger.error(f"Evaluation failed: {str(e)}")
+        return 1
+    
+    return 0
+
+
+if __name__ == "__main__":
+    exit(asyncio.run(main())) 
+\ No newline at end of file
diff --git a/putnam-bench-anon/scripts/benchmark.py b/putnam-bench-anon/scripts/benchmark.py
new file mode 100644
index 0000000..2fed228
--- /dev/null
+++ b/putnam-bench-anon/scripts/benchmark.py
@@ -0,0 +1,481 @@
+#!/usr/bin/env python3
+"""
+Benchmark script for comparing AI providers and models on mathematical problems.
+
+This script runs comparative evaluations across multiple providers, models, and
+problem variants to assess performance, accuracy, cost, and speed trade-offs.
+
+Usage:
+    python benchmark.py --config benchmark_config.json
+    python benchmark.py --quick-test  # Quick 3-problem test across all providers
+    python benchmark.py --providers openai anthropic --models gpt-4o-mini claude-3-5-haiku
+"""
+
+import asyncio
+import json
+import sys
+import time
+from pathlib import Path
+import argparse
+from typing import List, Dict, Any, Tuple
+import logging
+from datetime import datetime
+import itertools
+import statistics
+
+# Add the loader module to the path
+sys.path.append(str(Path(__file__).parent))
+
+from loader import create_loader, get_supported_providers, get_default_models
+
+
+class BenchmarkRunner:
+    """Benchmark runner for AI providers."""
+    
+    def __init__(self, output_dir: Path = Path("benchmark_results")):
+        self.output_dir = output_dir
+        self.output_dir.mkdir(parents=True, exist_ok=True)
+        
+        # Setup logging
+        log_file = self.output_dir / f"benchmark_{datetime.now().strftime('%Y%m%d_%H%M%S')}.log"
+        logging.basicConfig(
+            level=logging.INFO,
+            format='%(asctime)s - %(levelname)s - %(message)s',
+            handlers=[
+                logging.FileHandler(log_file),
+                logging.StreamHandler(sys.stdout)
+            ]
+        )
+        self.logger = logging.getLogger(__name__)
+    
+    async def load_test_problems(self, dataset_path: Path, max_problems: int = 10) -> List[Dict[str, Any]]:
+        """Load test problems from dataset."""
+        json_files = list(dataset_path.glob("*.json"))[:max_problems]
+        
+        problems = []
+        for json_file in json_files:
+            try:
+                with open(json_file, 'r', encoding='utf-8') as f:
+                    data = json.load(f)
+                    data['_source_file'] = str(json_file.name)
+                    problems.append(data)
+            except Exception as e:
+                self.logger.warning(f"Failed to load {json_file}: {str(e)}")
+        
+        return problems
+    
+    async def run_single_configuration(self, 
+                                     provider: str, 
+                                     solver_model: str, 
+                                     grader_model: str,
+                                     problems: List[Dict[str, Any]], 
+                                     variant_type: str = "original",
+                                     **loader_kwargs) -> Dict[str, Any]:
+        """Run benchmark for a single provider/model configuration."""
+        config_name = f"{provider}_{solver_model}_{grader_model}".replace("/", "_").replace("-", "_")
+        self.logger.info(f"🚀 Testing configuration: {config_name}")
+        
+        result = {
+            'configuration': {
+                'provider': provider,
+                'solver_model': solver_model,
+                'grader_model': grader_model,
+                'variant_type': variant_type,
+                'loader_kwargs': loader_kwargs
+            },
+            'metrics': {},
+            'problems': [],
+            'errors': []
+        }
+        
+        try:
+            # Create loader
+            loader = create_loader(
+                provider, 
+                solver_model=solver_model, 
+                grader_model=grader_model,
+                **loader_kwargs
+            )
+            
+            # Health check
+            if not await loader.health_check():
+                raise RuntimeError(f"Health check failed for {provider}")
+            
+            # Cost estimation
+            cost_info = await loader.estimate_cost(len(problems))
+            result['metrics']['estimated_cost'] = cost_info
+            
+            # Process each problem
+            start_time = time.time()
+            grades = []
+            processing_times = []
+            
+            for i, problem in enumerate(problems):
+                problem_start = time.time()
+                
+                try:
+                    problem_result = await loader.test_single_problem(
+                        problem, 
+                        variant_type=variant_type
+                    )
+                    
+                    processing_time = time.time() - problem_start
+                    # Convert boolean 'correct' to numeric grade (10 for correct, 0 for incorrect)
+                    grade = 10 if problem_result.get('correct', False) else 0
+                    
+                    grades.append(grade)
+                    processing_times.append(processing_time)
+                    
+                    result['problems'].append({
+                        'source_file': problem.get('_source_file', f'problem_{i}'),
+                        'grade': grade,
+                        'processing_time': processing_time,
+                        'solution_length': len(problem_result.get('solution', '')),
+                        'grading_feedback_length': len(str(problem_result.get('grading_result', {}).get('feedback', '')))
+                    })
+                    
+                    self.logger.info(f"   Problem {i+1}/{len(problems)}: Grade {grade} ({processing_time:.2f}s)")
+                    
+                except Exception as e:
+                    error_info = {
+                        'problem_index': i,
+                        'source_file': problem.get('_source_file', f'problem_{i}'),
+                        'error': str(e),
+                        'processing_time': time.time() - problem_start
+                    }
+                    result['errors'].append(error_info)
+                    self.logger.error(f"   Problem {i+1}/{len(problems)} failed: {str(e)}")
+            
+            total_time = time.time() - start_time
+            
+            # Calculate metrics
+            if grades:
+                result['metrics'].update({
+                    'total_problems': len(problems),
+                    'successful_problems': len(grades),
+                    'failed_problems': len(result['errors']),
+                    'success_rate': len(grades) / len(problems) * 100,
+                    'average_grade': statistics.mean(grades),
+                    'median_grade': statistics.median(grades),
+                    'grade_std': statistics.stdev(grades) if len(grades) > 1 else 0,
+                    'max_grade': max(grades),
+                    'min_grade': min(grades),
+                    'total_time': total_time,
+                    'average_time_per_problem': statistics.mean(processing_times),
+                    'median_time_per_problem': statistics.median(processing_times),
+                    'total_time_successful': sum(processing_times),
+                    'throughput_problems_per_minute': len(grades) / (total_time / 60) if total_time > 0 else 0
+                })
+            else:
+                result['metrics'].update({
+                    'total_problems': len(problems),
+                    'successful_problems': 0,
+                    'failed_problems': len(result['errors']),
+                    'success_rate': 0,
+                    'total_time': total_time,
+                    'error_rate': 100
+                })
+            
+            self.logger.info(f"✅ Configuration completed: {result['metrics']['success_rate']:.1f}% success, "
+                           f"avg grade: {result['metrics'].get('average_grade', 0):.2f}")
+            
+        except Exception as e:
+            result['metrics']['fatal_error'] = str(e)
+            self.logger.error(f"❌ Configuration failed: {str(e)}")
+        
+        return result
+    
+    async def run_comparative_benchmark(self, 
+                                      configurations: List[Dict[str, Any]], 
+                                      problems: List[Dict[str, Any]],
+                                      variant_type: str = "original") -> Dict[str, Any]:
+        """Run comparative benchmark across multiple configurations."""
+        self.logger.info(f"🏁 Starting comparative benchmark with {len(configurations)} configurations")
+        self.logger.info(f"📊 Testing {len(problems)} problems with variant: {variant_type}")
+        
+        benchmark_start = time.time()
+        results = []
+        
+        for i, config in enumerate(configurations):
+            self.logger.info(f"\n📋 Configuration {i+1}/{len(configurations)}")
+            
+            provider = config['provider']
+            solver_model = config.get('solver_model')
+            grader_model = config.get('grader_model')
+            loader_kwargs = config.get('loader_kwargs', {})
+            
+            # Use defaults if not specified
+            if not solver_model or not grader_model:
+                defaults = get_default_models(provider)
+                solver_model = solver_model or defaults['solver_model']
+                grader_model = grader_model or defaults['grader_model']
+            
+            config_result = await self.run_single_configuration(
+                provider=provider,
+                solver_model=solver_model,
+                grader_model=grader_model,
+                problems=problems,
+                variant_type=variant_type,
+                **loader_kwargs
+            )
+            
+            results.append(config_result)
+        
+        total_benchmark_time = time.time() - benchmark_start
+        
+        # Generate comparison report
+        report = self.generate_comparison_report(results, total_benchmark_time)
+        
+        # Save detailed results
+        timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
+        detailed_file = self.output_dir / f"benchmark_detailed_{timestamp}.json"
+        with open(detailed_file, 'w', encoding='utf-8') as f:
+            json.dump({
+                'benchmark_info': {
+                    'timestamp': datetime.now().isoformat(),
+                    'total_configurations': len(configurations),
+                    'total_problems': len(problems),
+                    'variant_type': variant_type,
+                    'total_time': total_benchmark_time
+                },
+                'configurations': configurations,
+                'results': results,
+                'comparison_report': report
+            }, f, indent=2, ensure_ascii=False)
+        
+        self.logger.info(f"💾 Detailed results saved to {detailed_file}")
+        
+        return report
+    
+    def generate_comparison_report(self, results: List[Dict[str, Any]], total_time: float) -> Dict[str, Any]:
+        """Generate comparison report from benchmark results."""
+        self.logger.info("\n" + "="*60)
+        self.logger.info("📊 BENCHMARK COMPARISON REPORT")
+        self.logger.info("="*60)
+        
+        # Filter successful results
+        successful_results = [r for r in results if r['metrics'].get('success_rate', 0) > 0]
+        
+        if not successful_results:
+            self.logger.warning("⚠️ No successful configurations found!")
+            return {'error': 'No successful configurations'}
+        
+        # Ranking by different metrics
+        rankings = {
+            'accuracy': sorted(successful_results, key=lambda x: x['metrics']['average_grade'], reverse=True),
+            'speed': sorted(successful_results, key=lambda x: x['metrics']['average_time_per_problem']),
+            'throughput': sorted(successful_results, key=lambda x: x['metrics']['throughput_problems_per_minute'], reverse=True),
+            'success_rate': sorted(successful_results, key=lambda x: x['metrics']['success_rate'], reverse=True)
+        }
+        
+        # Print rankings
+        for metric, ranked_results in rankings.items():
+            self.logger.info(f"\n🏆 Top 3 by {metric.upper()}:")
+            for i, result in enumerate(ranked_results[:3]):
+                config = result['configuration']
+                metrics = result['metrics']
+                provider = config['provider']
+                solver = config['solver_model']
+                
+                if metric == 'accuracy':
+                    value = f"{metrics['average_grade']:.2f}"
+                elif metric == 'speed':
+                    value = f"{metrics['average_time_per_problem']:.2f}s"
+                elif metric == 'throughput':
+                    value = f"{metrics['throughput_problems_per_minute']:.1f} prob/min"
+                elif metric == 'success_rate':
+                    value = f"{metrics['success_rate']:.1f}%"
+                
+                self.logger.info(f"   {i+1}. {provider}/{solver}: {value}")
+        
+        # Calculate cost efficiency
+        cost_efficiency = []
+        for result in successful_results:
+            metrics = result['metrics']
+            cost_info = metrics.get('estimated_cost', {})
+            total_cost = cost_info.get('total_cost', 0)
+            avg_grade = metrics.get('average_grade', 0)
+            
+            if total_cost > 0 and avg_grade > 0:
+                efficiency = avg_grade / total_cost  # Grade per unit cost
+                cost_efficiency.append({
+                    'result': result,
+                    'efficiency': efficiency,
+                    'cost': total_cost,
+                    'grade': avg_grade
+                })
+        
+        if cost_efficiency:
+            cost_efficiency.sort(key=lambda x: x['efficiency'], reverse=True)
+            self.logger.info(f"\n💰 Top 3 by COST EFFICIENCY (Grade/Cost):")
+            for i, item in enumerate(cost_efficiency[:3]):
+                config = item['result']['configuration']
+                provider = config['provider']
+                solver = config['solver_model']
+                self.logger.info(f"   {i+1}. {provider}/{solver}: {item['efficiency']:.2f} "
+                               f"(Grade: {item['grade']:.2f}, Cost: {item['cost']:.4f})")
+        
+        # Overall statistics
+        all_grades = []
+        all_times = []
+        all_success_rates = []
+        
+        for result in successful_results:
+            metrics = result['metrics']
+            all_grades.append(metrics['average_grade'])
+            all_times.append(metrics['average_time_per_problem'])
+            all_success_rates.append(metrics['success_rate'])
+        
+        self.logger.info(f"\n📈 OVERALL STATISTICS:")
+        self.logger.info(f"   Configurations tested: {len(results)}")
+        self.logger.info(f"   Successful configurations: {len(successful_results)}")
+        self.logger.info(f"   Average grade across all: {statistics.mean(all_grades):.2f}")
+        self.logger.info(f"   Average time per problem: {statistics.mean(all_times):.2f}s")
+        self.logger.info(f"   Average success rate: {statistics.mean(all_success_rates):.1f}%")
+        self.logger.info(f"   Total benchmark time: {total_time/60:.2f} minutes")
+        
+        # Generate final report
+        report = {
+            'summary': {
+                'total_configurations': len(results),
+                'successful_configurations': len(successful_results),
+                'overall_avg_grade': statistics.mean(all_grades) if all_grades else 0,
+                'overall_avg_time': statistics.mean(all_times) if all_times else 0,
+                'overall_avg_success_rate': statistics.mean(all_success_rates) if all_success_rates else 0,
+                'total_benchmark_time': total_time
+            },
+            'rankings': {
+                metric: [
+                    {
+                        'provider': r['configuration']['provider'],
+                        'solver_model': r['configuration']['solver_model'],
+                        'grader_model': r['configuration']['grader_model'],
+                        'score': r['metrics'][metric_key]
+                    }
+                    for r in ranked[:5]  # Top 5
+                ] for metric, ranked in rankings.items() 
+                for metric_key in [{'accuracy': 'average_grade', 'speed': 'average_time_per_problem', 
+                                 'throughput': 'throughput_problems_per_minute', 'success_rate': 'success_rate'}[metric]]
+            },
+            'cost_efficiency': [
+                {
+                    'provider': item['result']['configuration']['provider'],
+                    'solver_model': item['result']['configuration']['solver_model'],
+                    'efficiency': item['efficiency'],
+                    'grade': item['grade'],
+                    'cost': item['cost']
+                }
+                for item in cost_efficiency[:5]
+            ] if cost_efficiency else []
+        }
+        
+        return report
+
+
+async def run_quick_test():
+    """Run a quick test across all providers with 3 problems."""
+    runner = BenchmarkRunner()
+    
+    # Load 3 test problems
+    problems = await runner.load_test_problems(Path("dataset"), max_problems=3)
+    if not problems:
+        print("❌ No test problems found in dataset directory")
+        return
+    
+    # Default configurations for all providers
+    configurations = []
+    for provider in get_supported_providers():
+        config = {'provider': provider}
+        
+        # Provider-specific settings
+        if provider == 'vllm':
+            config['loader_kwargs'] = {'base_url': 'http://localhost:8000/v1'}
+        elif provider == 'huggingface':
+            config['loader_kwargs'] = {
+                'device': 'cpu',
+                'solver_model': 'microsoft/DialoGPT-small',
+                'grader_model': 'microsoft/DialoGPT-small'
+            }
+        
+        configurations.append(config)
+    
+    # Run benchmark
+    await runner.run_comparative_benchmark(configurations, problems)
+
+
+async def run_custom_benchmark(config_file: Path):
+    """Run benchmark from configuration file."""
+    with open(config_file, 'r', encoding='utf-8') as f:
+        config = json.load(f)
+    
+    runner = BenchmarkRunner(Path(config.get('output_dir', 'benchmark_results')))
+    
+    # Load problems
+    dataset_path = Path(config.get('dataset_path', 'dataset'))
+    max_problems = config.get('max_problems', 10)
+    variant_type = config.get('variant_type', 'original')
+    
+    problems = await runner.load_test_problems(dataset_path, max_problems)
+    if not problems:
+        print(f"❌ No problems found in {dataset_path}")
+        return
+    
+    # Load configurations
+    configurations = config.get('configurations', [])
+    if not configurations:
+        print("❌ No configurations specified in config file")
+        return
+    
+    # Run benchmark
+    await runner.run_comparative_benchmark(configurations, problems, variant_type)
+
+
+async def main():
+    """Main function."""
+    parser = argparse.ArgumentParser(description="Benchmark AI providers on mathematical problems")
+    
+    # Benchmark modes
+    group = parser.add_mutually_exclusive_group(required=True)
+    group.add_argument("--config", type=Path, help="Configuration file path")
+    group.add_argument("--quick-test", action="store_true", 
+                      help="Quick test with 3 problems across all providers")
+    
+    # Custom benchmark options
+    parser.add_argument("--providers", nargs="+", choices=get_supported_providers(),
+                       help="Providers to test (for custom benchmark)")
+    parser.add_argument("--models", nargs="+", 
+                       help="Models to test (for custom benchmark)")
+    parser.add_argument("--dataset", type=Path, default="dataset",
+                       help="Dataset path (default: dataset)")
+    parser.add_argument("--max-problems", type=int, default=10,
+                       help="Maximum problems to test (default: 10)")
+    parser.add_argument("--variant", default="original",
+                       choices=["original", "descriptive_long", "kernel_variant"],
+                       help="Problem variant (default: original)")
+    parser.add_argument("--output-dir", type=Path, default="benchmark_results",
+                       help="Output directory (default: benchmark_results)")
+    
+    args = parser.parse_args()
+    
+    try:
+        if args.quick_test:
+            await run_quick_test()
+        elif args.config:
+            await run_custom_benchmark(args.config)
+        else:
+            # Custom benchmark mode (placeholder for future implementation)
+            print("Custom benchmark mode not yet implemented. Use --config or --quick-test.")
+            return 1
+        
+        return 0
+        
+    except KeyboardInterrupt:
+        print("\n⏸️ Benchmark interrupted by user")
+        return 1
+    except Exception as e:
+        print(f"\n❌ Benchmark failed: {str(e)}")
+        return 1
+
+
+if __name__ == "__main__":
+    exit(asyncio.run(main())) 
+\ No newline at end of file
diff --git a/putnam-bench-anon/scripts/compare_original_vs_kernel_test.py b/putnam-bench-anon/scripts/compare_original_vs_kernel_test.py
new file mode 100644
index 0000000..76952bd
--- /dev/null
+++ b/putnam-bench-anon/scripts/compare_original_vs_kernel_test.py
@@ -0,0 +1,630 @@
+#!/usr/bin/env python3
+"""
+原题 vs Kernel Variant 数学能力对比测试
+使用4o-mini解题，o3严格评分，比较两种题目的正确率差异
+"""
+
+import os
+import json
+import asyncio
+import pathlib
+import time
+import re
+import random
+from typing import Dict, List, Tuple, Optional
+import click
+import tqdm
+from openai import AsyncOpenAI, RateLimitError, APIError, APIConnectionError
+
+# Configuration
+SOLVER_MODEL = "gpt-4o-mini"  # 用于解题的模型
+GRADER_MODEL = "o3"          # 用于评分的模型
+SRC_DIR = pathlib.Path("raw/json")
+RESULTS_DIR = pathlib.Path("results/comparison_test")
+RESULTS_DIR.mkdir(parents=True, exist_ok=True)
+
+RETRIES = 4
+TIMEOUT_BASE = 600
+RESP_FMT = {"type": "json_object"}
+
+# 解题系统prompt - 4o-mini
+SOLVER_SYSTEM_PROMPT = """You are an expert mathematician solving competition-level problems.
+Provide detailed, step-by-step solutions with clear mathematical reasoning.
+
+Requirements:
+- Show all your work and intermediate steps
+- Justify each major step of your reasoning
+- Use proper mathematical notation
+- Be thorough but concise
+- State your final answer clearly
+
+Solve the problem completely and rigorously."""
+
+SOLVER_USER_TEMPLATE = """Please solve this mathematical problem:
+
+{problem_statement}
+
+Provide a complete solution with detailed reasoning. Return your response in JSON format:
+{{"solution": "your complete step-by-step solution with mathematical reasoning",
+  "final_answer": "your final answer in a clear, concise form"}}"""
+
+# 证明题严格评分系统prompt - o3
+PROOF_GRADER_SYSTEM_PROMPT = """You are an extremely strict mathematical grader evaluating competition-level PROOF problems.
+
+GRADING STANDARDS (BE VERY STRICT):
+- Mathematical rigor: Every step must be mathematically sound and justified
+- Logical flow: The reasoning must be clear, complete, and logically connected
+- Correctness: All calculations, algebraic manipulations, and conclusions must be correct
+- Completeness: The solution must address all parts of the problem fully
+- Precision: Mathematical statements must be precise and unambiguous
+
+FAILING CRITERIA (Mark as INCORRECT if ANY of these apply):
+- Any unjustified logical leap or gap in reasoning
+- Any computational error, no matter how small
+- Missing steps in critical parts of the argument
+- Imprecise or ambiguous mathematical statements
+- Incorrect final answer, even if approach is partially correct
+- Circular reasoning or logical fallacies
+- Misuse of mathematical theorems or definitions
+
+BE EXTREMELY STRICT. Competition mathematics proofs require perfect precision."""
+
+# 计算题相对宽松评分系统prompt - o3  
+CALCULATION_GRADER_SYSTEM_PROMPT = """You are a mathematical grader evaluating competition-level CALCULATION problems.
+
+GRADING STANDARDS FOR CALCULATION PROBLEMS:
+- Primary focus: Is the final answer correct?
+- Secondary focus: Is the overall approach reasonable and mathematically sound?
+- Computation: Allow minor computational slips if the method is correct and final answer is right
+
+GRADING CRITERIA:
+- CORRECT: Final answer is correct AND approach is fundamentally sound
+- INCORRECT: Final answer is wrong OR approach is fundamentally flawed
+
+For calculation problems, the final numerical answer is the most important criterion.
+Minor intermediate errors are acceptable if they don't affect the final result."""
+
+PROOF_GRADER_USER_TEMPLATE = """Grade this PROOF solution with extreme strictness.
+
+PROBLEM:
+{problem_statement}
+
+STUDENT SOLUTION:
+{solution}
+
+CORRECT REFERENCE SOLUTION:
+{reference_solution}
+
+Evaluate with maximum strictness. Every logical step must be perfect. Return JSON with:
+{{"grade": "CORRECT" or "INCORRECT",
+  "detailed_feedback": "specific detailed analysis of what is right/wrong",
+  "major_issues": "list of significant mathematical errors or gaps",
+  "final_answer_correct": true or false,
+  "reasoning_rigor_score": 0-10 integer (10=perfect rigor, 0=severely flawed),
+  "overall_assessment": "comprehensive evaluation summary"}}"""
+
+CALCULATION_GRADER_USER_TEMPLATE = """Grade this CALCULATION solution with focus on final answer correctness.
+
+PROBLEM:
+{problem_statement}
+
+STUDENT SOLUTION:
+{solution}
+
+CORRECT REFERENCE SOLUTION:
+{reference_solution}
+
+Focus primarily on whether the final answer is correct. Return JSON with:
+{{"grade": "CORRECT" or "INCORRECT",
+  "detailed_feedback": "specific detailed analysis of what is right/wrong",
+  "major_issues": "list of significant mathematical errors or gaps",
+  "final_answer_correct": true or false,
+  "reasoning_rigor_score": 0-10 integer (10=perfect rigor, 0=severely flawed),
+  "overall_assessment": "comprehensive evaluation summary"}}"""
+
+JSON_RE = re.compile(r"\{[\s\S]*\}")
+
+def parse_json_response(raw: str) -> Optional[Dict]:
+    """Parse JSON from LLM response with fallback strategies."""
+    if not raw:
+        return None
+    
+    try:
+        return json.loads(raw)
+    except:
+        pass
+    
+    match = JSON_RE.search(raw)
+    if match:
+        try:
+            return json.loads(match.group(0))
+        except:
+            pass
+    
+    try:
+        fixed = raw.replace('\\"', '"').replace('\\\\', '\\')
+        return json.loads(fixed)
+    except:
+        pass
+    
+    return None
+
+def to_str(x) -> str:
+    """Convert various types to string safely."""
+    if x is None:
+        return ""
+    if isinstance(x, str):
+        return x
+    if isinstance(x, (list, tuple)):
+        return "\n".join(map(str, x))
+    return str(x)
+
+async def call_api_with_retry(cli: AsyncOpenAI, model: str, messages: List[Dict]) -> Tuple[Optional[Dict], str]:
+    """Make OpenAI API call with retry logic."""
+    raw_response = ""
+    
+    for attempt in range(1, RETRIES + 1):
+        timeout = TIMEOUT_BASE * (2 ** (attempt - 1))
+        try:
+            # Set temperature based on model
+            # o3, o3-mini, and o4-mini require temperature 1.0
+            if any(model_name in model.lower() for model_name in ['o3', 'o3-mini', 'o4-mini']):
+                temperature = 1.0
+            else:
+                # Use temperature 0.0 for deterministic solving with other models
+                temperature = 0.0
+            
+            response = await asyncio.wait_for(
+                cli.chat.completions.create(
+                    model=model,
+                    messages=messages,
+                    temperature=temperature,
+                    response_format=RESP_FMT,
+                ),
+                timeout=timeout,
+            )
+            raw_response = response.choices[0].message.content or ""
+            parsed = parse_json_response(raw_response)
+            if parsed:
+                return parsed, raw_response
+            raise ValueError("Failed to parse JSON response")
+            
+        except RateLimitError as e:
+            print(f"🚫 RateLimitError (attempt {attempt}/{RETRIES}): {str(e)}")
+            if "insufficient_quota" in str(e):
+                print("⏳ Detected quota exhaustion - sleeping 15 minutes")
+                await asyncio.sleep(900)
+            else:
+                sleep_time = 2 ** attempt + random.random()
+                print(f"   ⏰ Rate limited, sleeping {sleep_time:.1f}s")
+                await asyncio.sleep(sleep_time)
+                
+        except (APIError, APIConnectionError, asyncio.TimeoutError, ValueError) as e:
+            print(f"❌ {type(e).__name__} (attempt {attempt}/{RETRIES}): {str(e)}")
+            if attempt == RETRIES:
+                return None, raw_response
+            sleep_time = 2 ** attempt + random.random()
+            print(f"   ⏰ Retrying in {sleep_time:.1f}s")
+            await asyncio.sleep(sleep_time)
+    
+    return None, raw_response
+
+async def solve_problem(cli: AsyncOpenAI, problem_statement: str) -> Tuple[Optional[Dict], str]:
+    """让4o-mini解题"""
+    messages = [
+        {"role": "system", "content": SOLVER_SYSTEM_PROMPT},
+        {"role": "user", "content": SOLVER_USER_TEMPLATE.format(
+            problem_statement=problem_statement
+        )}
+    ]
+    return await call_api_with_retry(cli, SOLVER_MODEL, messages)
+
+async def grade_solution(cli: AsyncOpenAI, problem_statement: str, solution: str, 
+                        reference_solution: str, problem_type: str = "proof") -> Tuple[Optional[Dict], str]:
+    """让o3根据题型评分 - 证明题严格，计算题注重答案"""
+    if problem_type == "calculation":
+        system_prompt = CALCULATION_GRADER_SYSTEM_PROMPT
+        user_template = CALCULATION_GRADER_USER_TEMPLATE
+    else:  # Default to proof (strict grading)
+        system_prompt = PROOF_GRADER_SYSTEM_PROMPT
+        user_template = PROOF_GRADER_USER_TEMPLATE
+    
+    messages = [
+        {"role": "system", "content": system_prompt},
+        {"role": "user", "content": user_template.format(
+            problem_statement=problem_statement,
+            solution=solution,
+            reference_solution=reference_solution
+        )}
+    ]
+    return await call_api_with_retry(cli, GRADER_MODEL, messages)
+
+async def test_single_file(file_path: pathlib.Path, cli: AsyncOpenAI) -> Dict:
+    """测试单个文件的原题和kernel variant"""
+    try:
+        # 加载数据
+        data = json.loads(file_path.read_text(encoding='utf-8'))
+        index = data.get("index", file_path.stem)
+        
+        # 检查必要字段
+        original_question = to_str(data.get("question", "")).strip()
+        original_solution = to_str(data.get("solution", "")).strip()
+        problem_type = data.get("problem_type", "proof")  # 默认为证明题，严格评分
+        
+        kv = data.get("variants", {}).get("kernel_variant")
+        if not kv:
+            return {
+                "index": index,
+                "status": "skipped",
+                "reason": "no_kernel_variant"
+            }
+        
+        kernel_question = to_str(kv.get("question", "")).strip()
+        kernel_solution = to_str(kv.get("solution", "")).strip()
+        
+        if not all([original_question, original_solution, kernel_question, kernel_solution]):
+            return {
+                "index": index,
+                "status": "skipped", 
+                "reason": "missing_fields"
+            }
+        
+        print(f"🧮 Testing {index} (Type: {problem_type.upper()})")
+        start_time = time.time()
+        
+        result = {
+            "index": index,
+            "status": "completed",
+            "timestamp": time.time(),
+            "problem_type": problem_type,
+            "original": {},
+            "kernel_variant": {},
+            "comparison": {}
+        }
+        
+        # 1. 让4o-mini解原题
+        print(f"   📝 Solving original problem...")
+        orig_solve_result, orig_solve_raw = await solve_problem(cli, original_question)
+        
+        if not orig_solve_result:
+            result["original"]["solve_status"] = "failed"
+            result["status"] = "failed"
+            return result
+        
+        orig_student_solution = to_str(orig_solve_result.get("solution", "")).strip()
+        orig_final_answer = to_str(orig_solve_result.get("final_answer", "")).strip()
+        
+        result["original"]["student_solution"] = orig_student_solution
+        result["original"]["student_final_answer"] = orig_final_answer
+        result["original"]["solve_status"] = "success"
+        
+        # 2. 让4o-mini解kernel variant
+        print(f"   📝 Solving kernel variant...")
+        kv_solve_result, kv_solve_raw = await solve_problem(cli, kernel_question)
+        
+        if not kv_solve_result:
+            result["kernel_variant"]["solve_status"] = "failed"
+            result["status"] = "failed"
+            return result
+        
+        kv_student_solution = to_str(kv_solve_result.get("solution", "")).strip()
+        kv_final_answer = to_str(kv_solve_result.get("final_answer", "")).strip()
+        
+        result["kernel_variant"]["student_solution"] = kv_student_solution
+        result["kernel_variant"]["student_final_answer"] = kv_final_answer
+        result["kernel_variant"]["solve_status"] = "success"
+        
+        # 3. o3根据题型评分原题解答
+        grading_style = "STRICT" if problem_type == "proof" else "LENIENT"
+        print(f"   🔍 Grading original solution ({grading_style})...")
+        orig_grade_result, orig_grade_raw = await grade_solution(
+            cli, original_question, orig_student_solution, original_solution, problem_type
+        )
+        
+        if not orig_grade_result:
+            result["original"]["grade_status"] = "failed"
+        else:
+            result["original"]["grade_status"] = "success"
+            result["original"]["grade"] = orig_grade_result.get("grade", "UNKNOWN")
+            result["original"]["detailed_feedback"] = orig_grade_result.get("detailed_feedback", "")
+            result["original"]["major_issues"] = orig_grade_result.get("major_issues", "")
+            result["original"]["final_answer_correct"] = orig_grade_result.get("final_answer_correct", False)
+            result["original"]["reasoning_rigor_score"] = orig_grade_result.get("reasoning_rigor_score", 0)
+            result["original"]["overall_assessment"] = orig_grade_result.get("overall_assessment", "")
+        
+        # 4. o3根据题型评分kernel variant解答
+        print(f"   🔍 Grading kernel variant solution ({grading_style})...")
+        kv_grade_result, kv_grade_raw = await grade_solution(
+            cli, kernel_question, kv_student_solution, kernel_solution, problem_type
+        )
+        
+        if not kv_grade_result:
+            result["kernel_variant"]["grade_status"] = "failed"
+        else:
+            result["kernel_variant"]["grade_status"] = "success"
+            result["kernel_variant"]["grade"] = kv_grade_result.get("grade", "UNKNOWN")
+            result["kernel_variant"]["detailed_feedback"] = kv_grade_result.get("detailed_feedback", "")
+            result["kernel_variant"]["major_issues"] = kv_grade_result.get("major_issues", "")
+            result["kernel_variant"]["final_answer_correct"] = kv_grade_result.get("final_answer_correct", False)
+            result["kernel_variant"]["reasoning_rigor_score"] = kv_grade_result.get("reasoning_rigor_score", 0)
+            result["kernel_variant"]["overall_assessment"] = kv_grade_result.get("overall_assessment", "")
+        
+        # 5. 比较分析
+        if (result["original"]["grade_status"] == "success" and 
+            result["kernel_variant"]["grade_status"] == "success"):
+            
+            orig_correct = result["original"]["grade"] == "CORRECT"
+            kv_correct = result["kernel_variant"]["grade"] == "CORRECT"
+            
+            result["comparison"]["original_correct"] = orig_correct
+            result["comparison"]["kernel_variant_correct"] = kv_correct
+            result["comparison"]["both_correct"] = orig_correct and kv_correct
+            result["comparison"]["both_incorrect"] = not orig_correct and not kv_correct
+            result["comparison"]["original_harder"] = not orig_correct and kv_correct  # 原题更难
+            result["comparison"]["kernel_variant_harder"] = orig_correct and not kv_correct  # kernel variant更难
+            
+            orig_rigor = result["original"]["reasoning_rigor_score"]
+            kv_rigor = result["kernel_variant"]["reasoning_rigor_score"]
+            result["comparison"]["rigor_difference"] = orig_rigor - kv_rigor  # 正数=原题推理更严谨
+        
+        total_time = time.time() - start_time
+        result["processing_time"] = total_time
+        
+        print(f"   ✅ Completed {index} in {total_time:.1f}s")
+        if result["comparison"]:
+            orig_status = "✅" if result["comparison"]["original_correct"] else "❌"
+            kv_status = "✅" if result["comparison"]["kernel_variant_correct"] else "❌"
+            print(f"      Original: {orig_status}, Kernel Variant: {kv_status}")
+        
+        return result
+        
+    except Exception as e:
+        return {
+            "index": index if 'index' in locals() else file_path.stem,
+            "status": "error",
+            "error": str(e),
+            "error_type": type(e).__name__,
+            "timestamp": time.time()
+        }
+
+async def save_detailed_results(results: List[Dict], output_file: str):
+    """保存详细结果"""
+    output_path = RESULTS_DIR / f"{output_file}_detailed.json"
+    try:
+        output_path.write_text(json.dumps(results, ensure_ascii=False, indent=2), encoding='utf-8')
+        print(f"💾 Detailed results saved to {output_path}")
+    except Exception as e:
+        print(f"❌ Failed to save detailed results: {e}")
+
+def generate_summary_report(results: List[Dict]) -> Dict:
+    """生成汇总报告"""
+    summary = {
+        "total_files": len(results),
+        "completed": 0,
+        "failed": 0,
+        "skipped": 0,
+        "by_problem_type": {
+            "proof": {"count": 0, "original_correct": 0, "kv_correct": 0},
+            "calculation": {"count": 0, "original_correct": 0, "kv_correct": 0}
+        },
+        "original_stats": {"correct": 0, "incorrect": 0, "total_graded": 0},
+        "kernel_variant_stats": {"correct": 0, "incorrect": 0, "total_graded": 0},
+        "comparison_stats": {
+            "both_correct": 0,
+            "both_incorrect": 0,
+            "original_harder": 0,
+            "kernel_variant_harder": 0,
+            "total_compared": 0
+        },
+        "rigor_analysis": {
+            "original_avg_rigor": 0,  
+            "kernel_variant_avg_rigor": 0,
+            "rigor_difference_avg": 0
+        }
+    }
+    
+    orig_rigor_scores = []
+    kv_rigor_scores = []
+    rigor_differences = []
+    
+    for result in results:
+        if result["status"] == "completed":
+            summary["completed"] += 1
+            
+            # 按题型统计
+            ptype = result.get("problem_type", "proof")
+            if ptype in summary["by_problem_type"]:
+                summary["by_problem_type"][ptype]["count"] += 1
+                if result["original"].get("grade") == "CORRECT":
+                    summary["by_problem_type"][ptype]["original_correct"] += 1
+                if result["kernel_variant"].get("grade") == "CORRECT":
+                    summary["by_problem_type"][ptype]["kv_correct"] += 1
+            
+            # 原题统计
+            if result["original"].get("grade_status") == "success":
+                summary["original_stats"]["total_graded"] += 1
+                if result["original"]["grade"] == "CORRECT":
+                    summary["original_stats"]["correct"] += 1
+                else:
+                    summary["original_stats"]["incorrect"] += 1
+                orig_rigor_scores.append(result["original"]["reasoning_rigor_score"])
+            
+            # kernel variant统计
+            if result["kernel_variant"].get("grade_status") == "success":
+                summary["kernel_variant_stats"]["total_graded"] += 1
+                if result["kernel_variant"]["grade"] == "CORRECT":
+                    summary["kernel_variant_stats"]["correct"] += 1
+                else:
+                    summary["kernel_variant_stats"]["incorrect"] += 1
+                kv_rigor_scores.append(result["kernel_variant"]["reasoning_rigor_score"])
+            
+            # 比较统计
+            if result.get("comparison"):
+                summary["comparison_stats"]["total_compared"] += 1
+                comp = result["comparison"]
+                if comp["both_correct"]:
+                    summary["comparison_stats"]["both_correct"] += 1
+                elif comp["both_incorrect"]:
+                    summary["comparison_stats"]["both_incorrect"] += 1
+                elif comp["original_harder"]:
+                    summary["comparison_stats"]["original_harder"] += 1
+                elif comp["kernel_variant_harder"]:
+                    summary["comparison_stats"]["kernel_variant_harder"] += 1
+                
+                rigor_differences.append(comp["rigor_difference"])
+        
+        elif result["status"] == "skipped":
+            summary["skipped"] += 1
+        else:
+            summary["failed"] += 1
+    
+    # 计算平均分
+    if orig_rigor_scores:
+        summary["rigor_analysis"]["original_avg_rigor"] = sum(orig_rigor_scores) / len(orig_rigor_scores)
+    if kv_rigor_scores:
+        summary["rigor_analysis"]["kernel_variant_avg_rigor"] = sum(kv_rigor_scores) / len(kv_rigor_scores)
+    if rigor_differences:
+        summary["rigor_analysis"]["rigor_difference_avg"] = sum(rigor_differences) / len(rigor_differences)
+    
+    # 计算正确率
+    if summary["original_stats"]["total_graded"] > 0:
+        summary["original_stats"]["accuracy"] = summary["original_stats"]["correct"] / summary["original_stats"]["total_graded"]
+    
+    if summary["kernel_variant_stats"]["total_graded"] > 0:
+        summary["kernel_variant_stats"]["accuracy"] = summary["kernel_variant_stats"]["correct"] / summary["kernel_variant_stats"]["total_graded"]
+    
+    return summary
+
+def print_summary_report(summary: Dict):
+    """打印汇总报告"""
+    print("\n" + "="*80)
+    print("📊 ORIGINAL vs KERNEL VARIANT COMPARISON REPORT")
+    print("="*80)
+    
+    print(f"📁 Total files: {summary['total_files']}")
+    print(f"✅ Completed: {summary['completed']}")
+    print(f"⏭️ Skipped: {summary['skipped']}")
+    print(f"❌ Failed: {summary['failed']}")
+    
+    print(f"\n📈 ACCURACY COMPARISON:")
+    orig_acc = summary["original_stats"].get("accuracy", 0) * 100
+    kv_acc = summary["kernel_variant_stats"].get("accuracy", 0) * 100
+    print(f"Original Problems:     {orig_acc:.1f}% ({summary['original_stats']['correct']}/{summary['original_stats']['total_graded']})")
+    print(f"Kernel Variants:       {kv_acc:.1f}% ({summary['kernel_variant_stats']['correct']}/{summary['kernel_variant_stats']['total_graded']})")
+    
+    if orig_acc > 0 and kv_acc > 0:
+        diff = orig_acc - kv_acc
+        if diff > 5:
+            print(f"📉 Kernel variants are {diff:.1f}% harder (as expected)")
+        elif diff < -5:
+            print(f"📈 Original problems are {-diff:.1f}% harder (unexpected)")
+        else:
+            print(f"📊 Similar difficulty (difference: {diff:.1f}%)")
+    
+    print(f"\n🎯 BY PROBLEM TYPE:")
+    for ptype, stats in summary["by_problem_type"].items():
+        if stats["count"] > 0:
+            orig_acc_type = (stats["original_correct"] / stats["count"]) * 100
+            kv_acc_type = (stats["kv_correct"] / stats["count"]) * 100
+            grading_note = " (STRICT grading)" if ptype == "proof" else " (LENIENT grading)"
+            print(f"{ptype.upper()} Problems{grading_note}:")
+            print(f"  Original:      {orig_acc_type:.1f}% ({stats['original_correct']}/{stats['count']})")
+            print(f"  Kernel Variant: {kv_acc_type:.1f}% ({stats['kv_correct']}/{stats['count']})")
+            if stats["count"] >= 3:  # Only show difference if we have enough samples
+                type_diff = orig_acc_type - kv_acc_type
+                print(f"  Difference:    {type_diff:+.1f}%")
+    
+    print(f"\n🔍 DETAILED COMPARISON:")
+    comp = summary["comparison_stats"]
+    total = comp["total_compared"]
+    if total > 0:
+        print(f"Both correct:          {comp['both_correct']:3d} ({comp['both_correct']/total*100:.1f}%)")
+        print(f"Both incorrect:        {comp['both_incorrect']:3d} ({comp['both_incorrect']/total*100:.1f}%)")
+        print(f"Original harder:       {comp['original_harder']:3d} ({comp['original_harder']/total*100:.1f}%)")
+        print(f"Kernel variant harder: {comp['kernel_variant_harder']:3d} ({comp['kernel_variant_harder']/total*100:.1f}%)")
+    
+    print(f"\n📏 REASONING RIGOR ANALYSIS:")
+    rigor = summary["rigor_analysis"]
+    print(f"Original avg rigor:    {rigor['original_avg_rigor']:.2f}/10")
+    print(f"Kernel variant rigor:  {rigor['kernel_variant_avg_rigor']:.2f}/10")
+    print(f"Difference:            {rigor['rigor_difference_avg']:.2f} (positive = original more rigorous)")
+    
+    print("="*80)
+
+@click.command()
+@click.option("-c", "--concurrency", default=16, show_default=True,
+              help="Maximum concurrent processing tasks")
+@click.option("--max-files", default=50, show_default=True,
+              help="Maximum number of files to test (for quick testing)")
+@click.option("--file-pattern", default="*.json", show_default=True,
+              help="File pattern to process")
+@click.option("--output-prefix", default="comparison_test", show_default=True,
+              help="Prefix for output files")
+@click.option("--debug", is_flag=True, help="Enable debug output")
+def main(concurrency: int, max_files: int, file_pattern: str, output_prefix: str, debug: bool):
+    """原题 vs Kernel Variant 数学能力对比测试"""
+    print(f"🧪 Starting Original vs Kernel Variant Comparison Test")
+    print(f"   Solver Model: {SOLVER_MODEL}")
+    print(f"   Grader Model: {GRADER_MODEL}")
+    print(f"   Max files: {max_files}")
+    print(f"   Concurrency: {concurrency}")
+    
+    if not os.getenv("OPENAI_API_KEY"):
+        print("❌ OPENAI_API_KEY environment variable not set!")
+        return
+    
+    # 找到测试文件
+    all_files = sorted(SRC_DIR.glob(file_pattern))
+    if max_files > 0:
+        all_files = all_files[:max_files]
+    
+    print(f"📁 Testing {len(all_files)} files")
+    
+    if not all_files:
+        print("❌ No files found to test!")
+        return
+    
+    async def run_test():
+        cli = AsyncOpenAI()
+        sem = asyncio.Semaphore(concurrency)
+        
+        async def worker(file_path: pathlib.Path):
+            async with sem:
+                return await test_single_file(file_path, cli)
+        
+        # 执行测试
+        results = []
+        progress_bar = tqdm.tqdm(total=len(all_files), desc="Testing", unit="file")
+        
+        tasks = [worker(f) for f in all_files]
+        for coro in asyncio.as_completed(tasks):
+            result = await coro
+            results.append(result)
+            progress_bar.update(1)
+        
+        progress_bar.close()
+        return results
+    
+    # 运行测试
+    results = asyncio.run(run_test())
+    
+    # 保存详细结果
+    timestamp = int(time.time())
+    output_name = f"{output_prefix}_{timestamp}"
+    asyncio.run(save_detailed_results(results, output_name))
+    
+    # 生成并显示汇总报告
+    summary = generate_summary_report(results)
+    print_summary_report(summary)
+    
+    # 保存汇总报告
+    summary_path = RESULTS_DIR / f"{output_name}_summary.json"
+    try:
+        summary_path.write_text(json.dumps(summary, ensure_ascii=False, indent=2), encoding='utf-8')
+        print(f"💾 Summary report saved to {summary_path}")
+    except Exception as e:
+        print(f"❌ Failed to save summary: {e}")
+
+if __name__ == "__main__":
+    main()
+ 
+\ No newline at end of file
diff --git a/putnam-bench-anon/scripts/health_check.py b/putnam-bench-anon/scripts/health_check.py
new file mode 100644
index 0000000..65c7855
--- /dev/null
+++ b/putnam-bench-anon/scripts/health_check.py
@@ -0,0 +1,376 @@
+#!/usr/bin/env python3
+"""
+Health check script for all AI providers.
+
+This script tests connectivity, API keys, and basic functionality for all
+supported AI providers. Useful for troubleshooting and verifying setup.
+
+Usage:
+    python health_check.py                    # Check all providers
+    python health_check.py --provider openai  # Check specific provider
+    python health_check.py --detailed         # Detailed diagnostics
+"""
+
+import asyncio
+import json
+import sys
+import os
+from pathlib import Path
+import argparse
+from typing import Dict, List, Any
+from datetime import datetime
+import platform
+
+# Add the loader module to the path
+sys.path.append(str(Path(__file__).parent))
+
+from loader import create_loader, get_supported_providers, get_default_models
+
+
+class HealthChecker:
+    """Health checker for AI providers."""
+    
+    def __init__(self, detailed: bool = False):
+        self.detailed = detailed
+        self.results = {}
+        
+    async def check_system_info(self) -> Dict[str, Any]:
+        """Check system information."""
+        import psutil
+        
+        return {
+            'python_version': platform.python_version(),
+            'platform': platform.platform(),
+            'cpu_count': psutil.cpu_count(),
+            'memory_total_gb': round(psutil.virtual_memory().total / (1024**3), 2),
+            'memory_available_gb': round(psutil.virtual_memory().available / (1024**3), 2),
+            'disk_free_gb': round(psutil.disk_usage('.').free / (1024**3), 2),
+            'timestamp': datetime.now().isoformat()
+        }
+    
+    async def check_environment_variables(self) -> Dict[str, Any]:
+        """Check required environment variables."""
+        env_vars = {
+            'OPENAI_API_KEY': os.getenv('OPENAI_API_KEY'),
+            'ANTHROPIC_API_KEY': os.getenv('ANTHROPIC_API_KEY'),
+            'GOOGLE_API_KEY': os.getenv('GOOGLE_API_KEY'),
+        }
+        
+        return {
+            var: {
+                'set': bool(value),
+                'length': len(value) if value else 0,
+                'preview': value[:8] + '...' if value and len(value) > 8 else value
+            }
+            for var, value in env_vars.items()
+        }
+    
+    async def check_dependencies(self) -> Dict[str, Any]:
+        """Check required Python packages."""
+        dependencies = {
+            'openai': 'OpenAI API client',
+            'anthropic': 'Anthropic API client',
+            'google-generativeai': 'Google Gemini API client',
+            'transformers': 'HuggingFace transformers',
+            'torch': 'PyTorch for local models',
+            'vllm': 'VLLM for local serving',
+            'psutil': 'System monitoring'
+        }
+        
+        results = {}
+        for package, description in dependencies.items():
+            try:
+                if package == 'google-generativeai':
+                    import google.generativeai
+                    version = getattr(google.generativeai, '__version__', 'unknown')
+                else:
+                    module = __import__(package)
+                    version = getattr(module, '__version__', 'unknown')
+                
+                results[package] = {
+                    'installed': True,
+                    'version': version,
+                    'description': description
+                }
+            except ImportError:
+                results[package] = {
+                    'installed': False,
+                    'version': None,
+                    'description': description
+                }
+        
+        return results
+    
+    async def check_provider(self, provider: str) -> Dict[str, Any]:
+        """Check a specific AI provider."""
+        print(f"🔍 Checking {provider}...")
+        
+        result = {
+            'provider': provider,
+            'available': False,
+            'health_check_passed': False,
+            'error': None,
+            'response_time': None,
+            'models': {},
+            'cost_estimation': None
+        }
+        
+        try:
+            # Get default models
+            default_models = get_default_models(provider)
+            result['models']['defaults'] = default_models
+            
+            # Provider-specific configuration
+            loader_kwargs = {}
+            if provider == 'vllm':
+                loader_kwargs['base_url'] = 'http://localhost:8000/v1'
+            elif provider == 'huggingface':
+                loader_kwargs['device'] = 'cpu'  # Use CPU for testing
+                # Use smaller models for testing
+                loader_kwargs['solver_model'] = 'microsoft/DialoGPT-small'
+                loader_kwargs['grader_model'] = 'microsoft/DialoGPT-small'
+            
+            # Create loader
+            start_time = asyncio.get_event_loop().time()
+            loader = create_loader(provider, **loader_kwargs)
+            creation_time = asyncio.get_event_loop().time() - start_time
+            
+            result['available'] = True
+            result['creation_time'] = creation_time
+            
+            # Get model info
+            model_info = loader.get_model_info()
+            result['models']['configured'] = model_info
+            
+            # Health check
+            health_start = asyncio.get_event_loop().time()
+            health_passed = await asyncio.wait_for(loader.health_check(), timeout=60)
+            health_time = asyncio.get_event_loop().time() - health_start
+            
+            result['health_check_passed'] = health_passed
+            result['response_time'] = health_time
+            
+            if health_passed:
+                # Cost estimation
+                try:
+                    cost_info = await loader.estimate_cost(10)
+                    result['cost_estimation'] = cost_info
+                except Exception as e:
+                    result['cost_estimation_error'] = str(e)
+                
+                # Try to list models if available
+                if hasattr(loader, 'list_models'):
+                    try:
+                        available_models = await loader.list_models()
+                        result['models']['available'] = available_models[:10]  # Limit output
+                    except Exception as e:
+                        result['models']['list_error'] = str(e)
+            
+        except asyncio.TimeoutError:
+            result['error'] = 'Health check timed out'
+        except Exception as e:
+            result['error'] = str(e)
+        
+        return result
+    
+    async def check_all_providers(self, specific_provider: str = None) -> Dict[str, Any]:
+        """Check all providers or a specific one."""
+        providers = [specific_provider] if specific_provider else get_supported_providers()
+        
+        print("🏥 AI Provider Health Check")
+        print("=" * 50)
+        
+        # System information
+        if self.detailed:
+            print("📊 System Information:")
+            system_info = await self.check_system_info()
+            for key, value in system_info.items():
+                print(f"   {key}: {value}")
+            print()
+        
+        # Environment variables
+        print("🔧 Environment Variables:")
+        env_info = await self.check_environment_variables()
+        for var, info in env_info.items():
+            status = "✅" if info['set'] else "❌"
+            print(f"   {status} {var}: {'Set' if info['set'] else 'Not set'}")
+        print()
+        
+        # Dependencies
+        print("📦 Dependencies:")
+        dep_info = await self.check_dependencies()
+        for package, info in dep_info.items():
+            status = "✅" if info['installed'] else "❌"
+            version = f" (v{info['version']})" if info['installed'] and info['version'] != 'unknown' else ""
+            print(f"   {status} {package}{version}")
+        print()
+        
+        # Provider checks
+        print("🤖 Provider Health Checks:")
+        provider_results = {}
+        
+        for provider in providers:
+            provider_result = await self.check_provider(provider)
+            provider_results[provider] = provider_result
+            
+            # Print summary
+            if provider_result['available']:
+                if provider_result['health_check_passed']:
+                    status = "✅"
+                    details = f"({provider_result['response_time']:.2f}s)"
+                else:
+                    status = "⚠️"
+                    details = "(Health check failed)"
+            else:
+                status = "❌"
+                details = f"({provider_result['error']})"
+            
+            print(f"   {status} {provider.upper()}: {details}")
+        
+        print()
+        
+        # Summary
+        total_providers = len(providers)
+        healthy_providers = sum(1 for r in provider_results.values() 
+                              if r['available'] and r['health_check_passed'])
+        
+        print("📋 Summary:")
+        print(f"   Total providers checked: {total_providers}")
+        print(f"   Healthy providers: {healthy_providers}")
+        print(f"   Success rate: {healthy_providers/total_providers*100:.1f}%")
+        
+        # Detailed results
+        final_results = {
+            'timestamp': datetime.now().isoformat(),
+            'summary': {
+                'total_providers': total_providers,
+                'healthy_providers': healthy_providers,
+                'success_rate': healthy_providers/total_providers*100
+            },
+            'environment': env_info,
+            'dependencies': dep_info,
+            'providers': provider_results
+        }
+        
+        if self.detailed:
+            final_results['system'] = system_info
+        
+        return final_results
+    
+    async def run_diagnostics(self, provider: str) -> Dict[str, Any]:
+        """Run detailed diagnostics for a specific provider."""
+        print(f"🔧 Running detailed diagnostics for {provider}...")
+        
+        result = await self.check_provider(provider)
+        
+        # Additional detailed checks
+        if result['available'] and result['health_check_passed']:
+            print(f"✅ {provider} is healthy!")
+            
+            # Test with a simple problem
+            print("🧪 Testing with a simple math problem...")
+            try:
+                loader_kwargs = {}
+                if provider == 'vllm':
+                    loader_kwargs['base_url'] = 'http://localhost:8000/v1'
+                elif provider == 'huggingface':
+                    loader_kwargs['device'] = 'cpu'
+                    loader_kwargs['solver_model'] = 'microsoft/DialoGPT-small'
+                    loader_kwargs['grader_model'] = 'microsoft/DialoGPT-small'
+                
+                loader = create_loader(provider, **loader_kwargs)
+                
+                # Simple test problem
+                test_problem = {
+                    'original': {
+                        'problem_statement': 'What is 2 + 2?',
+                        'solution': 'The answer is 4.',
+                        'problem_type': 'calculation'
+                    }
+                }
+                
+                start_time = asyncio.get_event_loop().time()
+                test_result = await asyncio.wait_for(
+                    loader.test_single_problem(test_problem, variant_type='original'),
+                    timeout=120
+                )
+                test_time = asyncio.get_event_loop().time() - start_time
+                
+                result['test_problem'] = {
+                    'success': True,
+                    'time': test_time,
+                    'grade': 10 if test_result.get('correct', False) else 0,
+                    'solution_length': len(test_result.get('solve', {}).get('solution', ''))
+                }
+                print(f"   ✅ Test completed in {test_time:.2f}s")
+                print(f"   📊 Grade: {10 if test_result.get('correct', False) else 0} ({'CORRECT' if test_result.get('correct', False) else 'INCORRECT'})")
+                
+            except asyncio.TimeoutError:
+                result['test_problem'] = {'success': False, 'error': 'Test timed out'}
+                print("   ⚠️ Test problem timed out")
+            except Exception as e:
+                result['test_problem'] = {'success': False, 'error': str(e)}
+                print(f"   ❌ Test problem failed: {str(e)}")
+        
+        return result
+
+
+async def main():
+    """Main function."""
+    parser = argparse.ArgumentParser(description="Health check for AI providers")
+    parser.add_argument("--provider", choices=get_supported_providers(),
+                       help="Check specific provider only")
+    parser.add_argument("--detailed", action="store_true",
+                       help="Show detailed system information")
+    parser.add_argument("--diagnostics", action="store_true",
+                       help="Run detailed diagnostics (requires --provider)")
+    parser.add_argument("--output", type=Path,
+                       help="Save results to JSON file")
+    parser.add_argument("--quiet", action="store_true",
+                       help="Suppress output, save to file only")
+    
+    args = parser.parse_args()
+    
+    if args.diagnostics and not args.provider:
+        print("❌ Error: --diagnostics requires --provider")
+        return 1
+    
+    # Redirect output if quiet
+    if args.quiet:
+        import io
+        sys.stdout = io.StringIO()
+    
+    checker = HealthChecker(detailed=args.detailed)
+    
+    try:
+        if args.diagnostics:
+            results = await checker.run_diagnostics(args.provider)
+        else:
+            results = await checker.check_all_providers(args.provider)
+        
+        # Save to file if requested
+        if args.output:
+            args.output.parent.mkdir(parents=True, exist_ok=True)
+            with open(args.output, 'w', encoding='utf-8') as f:
+                json.dump(results, f, indent=2, ensure_ascii=False)
+            
+            if not args.quiet:
+                print(f"\n💾 Results saved to {args.output}")
+        
+        # Print JSON if quiet mode
+        if args.quiet:
+            sys.stdout = sys.__stdout__
+            print(json.dumps(results, indent=2))
+        
+        return 0
+        
+    except KeyboardInterrupt:
+        print("\n⏸️ Health check interrupted by user")
+        return 1
+    except Exception as e:
+        print(f"\n❌ Health check failed: {str(e)}")
+        return 1
+
+
+if __name__ == "__main__":
+    exit(asyncio.run(main())) 
+\ No newline at end of file
diff --git a/putnam-bench-anon/scripts/regrade.py b/putnam-bench-anon/scripts/regrade.py
new file mode 100644
index 0000000..ffc177e
--- /dev/null
+++ b/putnam-bench-anon/scripts/regrade.py
@@ -0,0 +1,284 @@
+#!/usr/bin/env python3
+"""
+Re-grade an existing results JSON file using a (possibly different) grader model.
+
+The script loads a results file produced by `batch_evaluate.py` (or a compatible
+JSON list) and re-grades every problem using the specified grader.  No solving
+is performed – instead we reuse the previously generated solutions stored in
+`solve.solution`.
+
+Example usage
+-------------
+python regrade.py \
+    --results-file results/o3/o3_original.json \
+    --dataset-dir dataset/ \
+    --provider openai \
+    --grader-model o3 \
+    --max-concurrent 5 \
+    --output results/regraded_o3_original.json
+
+"""
+
+import argparse
+import asyncio
+import json
+import sys
+import time
+from pathlib import Path
+from typing import Any, Dict, List
+from datetime import datetime
+import logging
+
+# Determine directories
+SCRIPT_DIR = Path(__file__).resolve().parent
+PROJECT_ROOT = SCRIPT_DIR.parent  # one level up
+
+# Add both the script dir and project root to PYTHONPATH to locate 'loader'
+sys.path.append(str(SCRIPT_DIR))
+sys.path.append(str(PROJECT_ROOT))
+
+from loader import create_loader  # type: ignore
+
+try:
+    from tqdm import tqdm  # type: ignore
+    HAS_TQDM = True
+except ImportError:  # pragma: no cover
+    HAS_TQDM = False
+
+    class tqdm:  # type: ignore
+        """Minimal fallback if tqdm is not available."""
+
+        def __init__(self, total=None, desc=None, **kwargs):
+            self.total = total
+            self.n = 0
+            self.desc = desc or ""
+            print(f"{self.desc}: starting …")
+
+        def update(self, n=1):
+            self.n += n
+            if self.total:
+                pct = self.n / self.total * 100
+                print(f"{self.desc}: {self.n}/{self.total} ({pct:.1f}%)", end="\r")
+
+        def set_postfix(self, _):
+            pass
+
+        def close(self):
+            print()  # newline
+
+
+###############################################################################
+# Helper functions
+###############################################################################
+
+
+def load_dataset(dataset_dir: Path) -> Dict[str, Dict[str, Any]]:
+    """Read every JSON file in *dataset_dir* and return a mapping index → data."""
+    dataset: Dict[str, Dict[str, Any]] = {}
+    for json_file in dataset_dir.glob("*.json"):
+        try:
+            with open(json_file, "r", encoding="utf-8") as fh:
+                data = json.load(fh)
+                idx = data.get("index")
+                if idx:
+                    dataset[idx] = data
+        except Exception as exc:  # pragma: no cover – best-effort ingest
+            logging.warning("Failed to load %s: %s", json_file, exc)
+    return dataset
+
+
+async def regrade_problem(loader,  # type: ignore[valid-type]
+                          problem_record: Dict[str, Any],
+                          dataset_entry: Dict[str, Any],
+                          variant_type: str) -> Dict[str, Any]:
+    """Re-grade one problem and return a new result dict."""
+
+    idx = problem_record.get("index", "unknown")
+    problem_type = dataset_entry.get("problem_type", "proof")
+
+    # Extract question & reference solution according to variant
+    if variant_type == "original":
+        question = str(dataset_entry.get("question", "")).strip()
+        reference_solution = str(dataset_entry.get("solution", "")).strip()
+    else:
+        variant = dataset_entry.get("variants", {}).get(variant_type, {})
+        question = str(variant.get("question", "")).strip()
+        reference_solution = str(variant.get("solution", "")).strip()
+
+    if not question or not reference_solution:
+        return {
+            "index": idx,
+            "status": "skipped",
+            "reason": "missing_fields",
+        }
+
+    # Previously generated solution
+    student_solution = str(problem_record.get("solve", {}).get("solution", "")).strip()
+    final_answer = str(problem_record.get("solve", {}).get("final_answer", "")).strip()
+
+    # Grade the solution (temperature hard-coded inside create_loader for o-series)
+    grade_result, _raw = await loader.grade_solution(
+        question,
+        student_solution,
+        reference_solution,
+        problem_type,
+    )
+
+    # Build merged record retaining original fields + new grade
+    new_record = {
+        "index": idx,
+        "variant_type": variant_type,
+        "problem_type": problem_type,
+        "solve": {
+            "solution": student_solution,
+            "final_answer": final_answer,
+        },
+        "grade": grade_result or {"status": "failed"},
+    }
+
+    # Convenience shortcut for correctness
+    new_record["correct"] = new_record["grade"].get("grade") == "CORRECT"
+    return new_record
+
+
+###############################################################################
+# Main orchestration
+###############################################################################
+
+
+async def main() -> None:  # noqa: C901 – single entry-point
+    parser = argparse.ArgumentParser(description="Re-grade an existing results file")
+    parser.add_argument("--results-file", required=True, type=Path, help="Path to existing results JSON")
+    parser.add_argument("--dataset-dir", required=True, type=Path, help="Directory containing dataset JSON files")
+    parser.add_argument("--provider", default="openai", help="Grader provider (default: openai)")
+    parser.add_argument("--grader-model", default="o3", help="Grader model name (default: o3)")
+    parser.add_argument("--max-concurrent", type=int, default=3, help="Max concurrent API calls")
+    parser.add_argument("--variant-type", default="original", help="Problem variant used in results file")
+    parser.add_argument("--output", type=Path, help="Where to write re-graded results (JSON)")
+    parser.add_argument("--quick", action="store_true", help="Quick mode – single retry, shorter timeouts")
+    parser.add_argument("--debug", action="store_true", help="Verbose JSON-parsing debug")
+
+    args = parser.parse_args()
+
+    # Configure logging early
+    logging.basicConfig(
+        level=logging.INFO,
+        format="%(asctime)s [%(levelname)s] %(message)s",
+        handlers=[logging.StreamHandler(sys.stdout)],
+    )
+
+    if not args.results_file.exists():
+        logging.error("Results file %s does not exist", args.results_file)
+        sys.exit(1)
+
+    if not args.dataset_dir.exists():
+        logging.error("Dataset directory %s does not exist", args.dataset_dir)
+        sys.exit(1)
+
+    # Load dataset into memory once
+    logging.info("Loading dataset from %s", args.dataset_dir)
+    dataset_map = load_dataset(args.dataset_dir)
+    logging.info("Loaded %d dataset entries", len(dataset_map))
+
+    # Load results JSON (support two formats: {'problems':[...]} or simple list)
+    with open(args.results_file, "r", encoding="utf-8") as fh:
+        raw_data = json.load(fh)
+
+    if isinstance(raw_data, dict) and "problems" in raw_data:
+        original_problems: List[Dict[str, Any]] = raw_data["problems"]  # type: ignore[assignment]
+    elif isinstance(raw_data, list):
+        original_problems = raw_data  # type: ignore[assignment]
+    else:
+        logging.error("Unsupported results file structure – expected list or dict with key 'problems'.")
+        sys.exit(1)
+
+    if not original_problems:
+        logging.warning("No problems found in results file – nothing to re-grade.")
+        sys.exit(0)
+
+    # Create loader – we only need grader, but solver_model must be provided; reuse grader_model
+    loader = create_loader(
+        args.provider,
+        solver_model=args.grader_model,
+        grader_model=args.grader_model,
+        quick=args.quick,
+        debug=args.debug,
+    )
+
+    if not await loader.health_check():
+        logging.error("Health check failed for provider %s", args.provider)
+        sys.exit(1)
+
+    # Estimate costs (rough – assumes avg lengths; tweak as needed)
+    cost_info = await loader.estimate_cost(len(original_problems))
+    logging.info("Estimated grading cost: $%.2f", cost_info.get("total_cost", 0))
+
+    # Concurrency control
+    semaphore = asyncio.Semaphore(args.max_concurrent)
+
+    async def wrapper(problem_record):
+        idx = problem_record.get("index", "unknown")
+        if idx not in dataset_map:
+            logging.warning("Dataset entry for index %s not found – skipping", idx)
+            return {"index": idx, "status": "skipped", "reason": "dataset_missing"}
+        async with semaphore:
+            return await regrade_problem(
+                loader,
+                problem_record,
+                dataset_map[idx],
+                args.variant_type,
+            )
+
+    # Progress bar setup
+    pbar = tqdm(total=len(original_problems), desc="Re-grading")
+    results: List[Dict[str, Any]] = []
+
+    async def gather_tasks():
+        for coro in asyncio.as_completed([wrapper(rec) for rec in original_problems]):
+            res = await coro
+            results.append(res)
+            pbar.update(1)
+    await gather_tasks()
+    pbar.close()
+
+    # Build summary
+    completed = [r for r in results if r.get("grade", {}).get("status") == "success"]
+    grades = [r["grade"].get("grade") for r in completed]
+    numeric = [5.0 if g == "CORRECT" else 2.5 for g in grades]
+
+    summary = {
+        "total_problems": len(results),
+        "completed": len(completed),
+        "correct": sum(1 for g in grades if g == "CORRECT"),
+        "incorrect": sum(1 for g in grades if g == "INCORRECT"),
+        "average_grade": sum(numeric) / len(numeric) if numeric else 0.0,
+        "provider": args.provider,
+        "grader_model": args.grader_model,
+        "variant_type": args.variant_type,
+        "estimated_cost": cost_info,
+        "timestamp": datetime.now().isoformat(),
+    }
+
+    output_payload = {
+        "summary": summary,
+        "problems": results,
+    }
+
+    # Determine output path
+    if args.output:
+        out_path = args.output
+    else:
+        stem = args.results_file.stem + f"_regraded_{args.grader_model}"
+        out_path = args.results_file.with_name(stem + args.results_file.suffix)
+
+    with open(out_path, "w", encoding="utf-8") as fh:
+        json.dump(output_payload, fh, indent=2, ensure_ascii=False)
+    logging.info("Saved re-graded results to %s", out_path)
+
+    # Clean up HTTP client if applicable
+    if hasattr(loader, "__aexit__"):
+        await loader.__aexit__(None, None, None)
+
+
+if __name__ == "__main__":
+    asyncio.run(main()) 
+\ No newline at end of file
diff --git a/putnam-bench-anon/setup_config.py b/putnam-bench-anon/setup_config.py
new file mode 100644
index 0000000..a59a0e0
--- /dev/null
+++ b/putnam-bench-anon/setup_config.py
@@ -0,0 +1,440 @@
+#!/usr/bin/env python3
+"""
+Configuration setup script for Putnam Problem Solver.
+
+This script helps users set up API keys, configure providers,
+and verify their environment for mathematical problem solving.
+
+Usage:
+    python setup_config.py                      # Interactive setup
+    python setup_config.py --check              # Check current configuration  
+    python setup_config.py --provider openai    # Setup specific provider
+"""
+
+import asyncio
+import json
+import sys
+import os
+from pathlib import Path
+import argparse
+from typing import Dict, Any, Optional
+import getpass
+import subprocess
+
+# Add the loader module to the path
+sys.path.append(str(Path(__file__).parent))
+
+from loader import get_supported_providers, get_default_models
+
+
+class ConfigManager:
+    """Configuration manager for Putnam Problem Solver."""
+    
+    def __init__(self):
+        self.config_file = Path.home() / ".putnam_config.json"
+        self.env_file = Path.home() / ".putnam_env"
+    
+    def print_banner(self):
+        """Print setup banner."""
+        print("🛠️ Putnam Problem Solver Configuration Setup")
+        print("=" * 55)
+        print("This script will help you configure API keys and settings.")
+        print()
+    
+    def load_config(self) -> Dict[str, Any]:
+        """Load existing configuration."""
+        if self.config_file.exists():
+            try:
+                with open(self.config_file, 'r', encoding='utf-8') as f:
+                    return json.load(f)
+            except Exception:
+                pass
+        return {}
+    
+    def save_config(self, config: Dict[str, Any]):
+        """Save configuration to file."""
+        with open(self.config_file, 'w', encoding='utf-8') as f:
+            json.dump(config, f, indent=2)
+        print(f"✅ Configuration saved to {self.config_file}")
+    
+    def update_env_file(self, env_vars: Dict[str, str]):
+        """Update environment file."""
+        lines = []
+        
+        # Read existing lines
+        if self.env_file.exists():
+            with open(self.env_file, 'r') as f:
+                lines = [line.strip() for line in f if line.strip() and not line.startswith('#')]
+        
+        # Remove existing vars that we're updating
+        lines = [line for line in lines if not any(line.startswith(f"{var}=") for var in env_vars)]
+        
+        # Add new vars
+        for var, value in env_vars.items():
+            if value:
+                lines.append(f"export {var}={value}")
+        
+        # Write back
+        with open(self.env_file, 'w') as f:
+            f.write("#!/bin/bash\n")
+            f.write("# Putnam Problem Solver Environment Variables\n")
+            f.write("# Source this file: source ~/.putnam_env\n\n")
+            for line in lines:
+                f.write(line + "\n")
+        
+        print(f"✅ Environment file updated: {self.env_file}")
+        print(f"💡 Add to your shell profile: source {self.env_file}")
+    
+    def check_dependencies(self) -> Dict[str, bool]:
+        """Check required dependencies."""
+        dependencies = {
+            'python': True,  # We're running Python
+            'pip': self._command_exists('pip'),
+            'openai': self._package_installed('openai'),
+            'anthropic': self._package_installed('anthropic'),
+            'google-generativeai': self._package_installed('google-generativeai'),
+            'transformers': self._package_installed('transformers'),
+            'torch': self._package_installed('torch'),
+            'vllm': self._package_installed('vllm'),
+            'psutil': self._package_installed('psutil')
+        }
+        return dependencies
+    
+    def _command_exists(self, command: str) -> bool:
+        """Check if a command exists."""
+        try:
+            subprocess.run([command, '--version'], 
+                         capture_output=True, check=True)
+            return True
+        except (subprocess.CalledProcessError, FileNotFoundError):
+            return False
+    
+    def _package_installed(self, package: str) -> bool:
+        """Check if a Python package is installed."""
+        try:
+            if package == 'google-generativeai':
+                import google.generativeai
+            else:
+                __import__(package)
+            return True
+        except ImportError:
+            return False
+    
+    def install_dependencies(self, packages: list):
+        """Install missing dependencies."""
+        if not packages:
+            print("✅ All dependencies are installed!")
+            return
+        
+        print(f"📦 Installing missing packages: {', '.join(packages)}")
+        
+        # Create requirements for missing packages
+        package_map = {
+            'openai': 'openai',
+            'anthropic': 'anthropic',
+            'google-generativeai': 'google-generativeai',
+            'transformers': 'transformers',
+            'torch': 'torch',
+            'vllm': 'vllm',
+            'psutil': 'psutil'
+        }
+        
+        to_install = [package_map[pkg] for pkg in packages if pkg in package_map]
+        
+        if to_install:
+            try:
+                subprocess.run([sys.executable, '-m', 'pip', 'install'] + to_install, 
+                             check=True)
+                print("✅ Dependencies installed successfully!")
+            except subprocess.CalledProcessError as e:
+                print(f"❌ Failed to install dependencies: {e}")
+    
+    def setup_provider(self, provider: str, config: Dict[str, Any]) -> Dict[str, Any]:
+        """Setup a specific provider."""
+        print(f"\n🔧 Setting up {provider.upper()}")
+        print("-" * 30)
+        
+        provider_config = config.get('providers', {}).get(provider, {})
+        
+        if provider == 'openai':
+            api_key = self._get_api_key(
+                "OpenAI API Key", 
+                provider_config.get('api_key'),
+                "Get your key from: https://platform.openai.com/api-keys"
+            )
+            if api_key:
+                provider_config['api_key'] = api_key
+                os.environ['OPENAI_API_KEY'] = api_key
+        
+        elif provider == 'anthropic':
+            api_key = self._get_api_key(
+                "Anthropic API Key",
+                provider_config.get('api_key'),
+                "Get your key from: https://console.anthropic.com/"
+            )
+            if api_key:
+                provider_config['api_key'] = api_key
+                os.environ['ANTHROPIC_API_KEY'] = api_key
+        
+        elif provider == 'gemini':
+            api_key = self._get_api_key(
+                "Google API Key",
+                provider_config.get('api_key'), 
+                "Get your key from: https://makersuite.google.com/app/apikey"
+            )
+            if api_key:
+                provider_config['api_key'] = api_key
+                os.environ['GOOGLE_API_KEY'] = api_key
+                
+        elif provider == 'kimi':
+            api_key = self._get_api_key(
+                "Kimi/Moonshot API Key",
+                provider_config.get('api_key'),
+                "Get your key from: https://platform.moonshot.ai/"
+            )
+            if api_key:
+                provider_config['api_key'] = api_key
+                os.environ['MOONSHOT_API_KEY'] = api_key
+        
+        elif provider == 'vllm':
+            current_url = provider_config.get('base_url', 'http://localhost:8000/v1')
+            print(f"Current VLLM server URL: {current_url}")
+            new_url = input("Enter VLLM server URL (press Enter to keep current): ").strip()
+            if new_url:
+                provider_config['base_url'] = new_url
+            else:
+                provider_config['base_url'] = current_url
+            
+            print("💡 Make sure your VLLM server is running:")
+            print("   vllm serve meta-llama/Llama-3.2-8B-Instruct --port 8000")
+        
+        elif provider == 'huggingface':
+            print("HuggingFace runs locally - no API key needed.")
+            device = provider_config.get('device', 'auto')
+            print(f"Current device setting: {device}")
+            new_device = input("Device (auto/cuda/cpu) [press Enter to keep current]: ").strip()
+            if new_device in ['auto', 'cuda', 'cpu']:
+                provider_config['device'] = new_device
+            
+            print("💡 HuggingFace will download models on first use.")
+        
+        # Update config
+        if 'providers' not in config:
+            config['providers'] = {}
+        config['providers'][provider] = provider_config
+        
+        return config
+    
+    def _get_api_key(self, prompt: str, current_key: Optional[str], help_text: str) -> Optional[str]:
+        """Get API key from user."""
+        if current_key:
+            masked_key = current_key[:8] + "..." if len(current_key) > 8 else "***"
+            print(f"Current key: {masked_key}")
+            
+            if input("Update API key? (y/n): ").lower().startswith('y'):
+                print(help_text)
+                return getpass.getpass(f"Enter {prompt}: ").strip()
+            else:
+                return current_key
+        else:
+            print(f"No {prompt} found.")
+            print(help_text)
+            if input("Enter API key now? (y/n): ").lower().startswith('y'):
+                return getpass.getpass(f"Enter {prompt}: ").strip()
+        
+        return None
+    
+    async def test_provider(self, provider: str) -> bool:
+        """Test a provider configuration."""
+        print(f"🧪 Testing {provider}...")
+        
+        try:
+            from loader import create_loader
+            
+            loader_kwargs = {}
+            if provider == 'vllm':
+                config = self.load_config()
+                vllm_config = config.get('providers', {}).get('vllm', {})
+                loader_kwargs['base_url'] = vllm_config.get('base_url', 'http://localhost:8000/v1')
+            elif provider == 'huggingface':
+                config = self.load_config()
+                hf_config = config.get('providers', {}).get('huggingface', {})
+                loader_kwargs['device'] = hf_config.get('device', 'cpu')
+                loader_kwargs['solver_model'] = 'microsoft/DialoGPT-small'
+                loader_kwargs['grader_model'] = 'microsoft/DialoGPT-small'
+            
+            loader = create_loader(provider, **loader_kwargs)
+            
+            # Simple health check
+            is_healthy = await loader.health_check()
+            
+            if is_healthy:
+                print(f"✅ {provider} is working correctly!")
+                return True
+            else:
+                print(f"❌ {provider} health check failed")
+                return False
+                
+        except Exception as e:
+            print(f"❌ {provider} test failed: {str(e)}")
+            return False
+    
+    def check_current_config(self):
+        """Check and display current configuration."""
+        print("📋 Current Configuration Status")
+        print("=" * 40)
+        
+        # Environment variables
+        print("\n🔧 Environment Variables:")
+        env_vars = {
+            'OPENAI_API_KEY': os.getenv('OPENAI_API_KEY'),
+            'ANTHROPIC_API_KEY': os.getenv('ANTHROPIC_API_KEY'),
+            'GOOGLE_API_KEY': os.getenv('GOOGLE_API_KEY')
+        }
+        
+        for var, value in env_vars.items():
+            if value:
+                masked = value[:8] + "..." if len(value) > 8 else "***"
+                print(f"   ✅ {var}: {masked}")
+            else:
+                print(f"   ❌ {var}: Not set")
+        
+        # Dependencies
+        print("\n📦 Dependencies:")
+        deps = self.check_dependencies()
+        for dep, installed in deps.items():
+            status = "✅" if installed else "❌"
+            print(f"   {status} {dep}")
+        
+        # Config file
+        print(f"\n📄 Config file: {self.config_file}")
+        if self.config_file.exists():
+            print("   ✅ Exists")
+            config = self.load_config()
+            providers = config.get('providers', {})
+            if providers:
+                print("   Configured providers:")
+                for provider in providers:
+                    print(f"     • {provider}")
+        else:
+            print("   ❌ Not found")
+        
+        # Environment file
+        print(f"\n🌍 Environment file: {self.env_file}")
+        if self.env_file.exists():
+            print("   ✅ Exists")
+            print(f"   💡 Source with: source {self.env_file}")
+        else:
+            print("   ❌ Not found")
+    
+    async def interactive_setup(self):
+        """Run interactive setup."""
+        self.print_banner()
+        
+        # Check dependencies first
+        print("🔍 Checking dependencies...")
+        deps = self.check_dependencies()
+        missing_deps = [dep for dep, installed in deps.items() if not installed and dep != 'pip']
+        
+        if missing_deps:
+            print(f"\n⚠️ Missing dependencies: {', '.join(missing_deps)}")
+            if input("Install missing dependencies? (y/n): ").lower().startswith('y'):
+                self.install_dependencies(missing_deps)
+        
+        # Load existing config
+        config = self.load_config()
+        
+        # Provider setup
+        print(f"\n🤖 Available providers: {', '.join(get_supported_providers())}")
+        
+        # Ask which providers to configure
+        if input("\nConfigure all providers? (y/n): ").lower().startswith('y'):
+            providers_to_setup = get_supported_providers()
+        else:
+            providers_to_setup = []
+            for provider in get_supported_providers():
+                if input(f"Configure {provider}? (y/n): ").lower().startswith('y'):
+                    providers_to_setup.append(provider)
+        
+        # Setup each provider
+        env_vars = {}
+        for provider in providers_to_setup:
+            config = self.setup_provider(provider, config)
+            
+            # Collect env vars
+            provider_config = config.get('providers', {}).get(provider, {})
+            if provider == 'openai' and 'api_key' in provider_config:
+                env_vars['OPENAI_API_KEY'] = provider_config['api_key']
+            elif provider == 'anthropic' and 'api_key' in provider_config:
+                env_vars['ANTHROPIC_API_KEY'] = provider_config['api_key']
+            elif provider == 'gemini' and 'api_key' in provider_config:
+                env_vars['GOOGLE_API_KEY'] = provider_config['api_key']
+        
+        # Save configuration
+        self.save_config(config)
+        
+        # Update environment file
+        if env_vars:
+            self.update_env_file(env_vars)
+        
+        # Test providers
+        if input("\nTest configured providers? (y/n): ").lower().startswith('y'):
+            print("\n🧪 Testing providers...")
+            for provider in providers_to_setup:
+                await self.test_provider(provider)
+        
+        print("\n🎉 Setup completed!")
+        print("\n💡 Next steps:")
+        print("   1. Source environment file: source ~/.putnam_env")
+        print("   2. Test a provider: python putnam_cli.py test --provider openai")
+        print("   3. Solve a problem: python putnam_cli.py solve dataset/1938-A-1.json")
+
+
+async def main():
+    """Main function."""
+    parser = argparse.ArgumentParser(description="Configure Putnam Problem Solver")
+    parser.add_argument("--check", action="store_true", help="Check current configuration")
+    parser.add_argument("--provider", choices=get_supported_providers(),
+                       help="Setup specific provider only")
+    parser.add_argument("--install-deps", action="store_true", help="Install missing dependencies")
+    parser.add_argument("--test", choices=get_supported_providers(),
+                       help="Test specific provider")
+    
+    args = parser.parse_args()
+    
+    manager = ConfigManager()
+    
+    try:
+        if args.check:
+            manager.check_current_config()
+        elif args.install_deps:
+            deps = manager.check_dependencies()
+            missing = [dep for dep, installed in deps.items() if not installed and dep != 'pip']
+            manager.install_dependencies(missing)
+        elif args.test:
+            await manager.test_provider(args.test)
+        elif args.provider:
+            manager.print_banner()
+            config = manager.load_config()
+            config = manager.setup_provider(args.provider, config)
+            manager.save_config(config)
+            
+            # Test the provider
+            if input(f"Test {args.provider}? (y/n): ").lower().startswith('y'):
+                await manager.test_provider(args.provider)
+        else:
+            # Interactive setup
+            await manager.interactive_setup()
+        
+        return 0
+        
+    except KeyboardInterrupt:
+        print("\n⏸️ Setup interrupted by user")
+        return 1
+    except Exception as e:
+        print(f"\n❌ Setup failed: {str(e)}")
+        return 1
+
+
+if __name__ == "__main__":
+    exit(asyncio.run(main())) 
+\ No newline at end of file
author	Yuren Hao <yurenh2@illinois.edu>	2026-04-08 22:06:05 -0500
committer	Yuren Hao <yurenh2@illinois.edu>	2026-04-08 22:06:05 -0500
commit	05704d0eb2fa59fe727652465b07db40bcb06c38 (patch)
tree	8904aca836cf552fd1a5ae8c2174e9f91e70bbbc /putnam-bench-anon