Initial release: GAP framework

- Full pipeline: variant generation, multi-judge verification, evaluation - Loaders for OpenAI / Anthropic / Google / xAI / OpenRouter / vLLM - Framework-level mechanism analyses: paired structural overlap, repairability rescue, self-correction probe, cross-model agreement, topic x problem-type interaction - Unicode -> bare-LaTeX cleaner + audit + spot-check - Mirrors https://huggingface.co/datasets/blackhao0426/PutnamGAP
author: Yuren Hao <yurenh2@illinois.edu> 2026-04-08 22:06:05 -0500
committer: Yuren Hao <yurenh2@illinois.edu> 2026-04-08 22:06:05 -0500
commit: 05704d0eb2fa59fe727652465b07db40bcb06c38 (patch)
tree: 8904aca836cf552fd1a5ae8c2174e9f91e70bbbc /putnam-bench-anon/README.md
1 files changed, 630 insertions, 0 deletions
diff --git a/putnam-bench-anon/README.md b/putnam-bench-anon/README.md
new file mode 100644
index 0000000..8010895
--- /dev/null
+++ b/putnam-bench-anon/README.md
@@ -0,0 +1,630 @@
+# Putnam Mathematical Problem Solver
+
+A comprehensive system for evaluating AI models on mathematical problem solving using the Putnam Competition dataset. This project provides a unified interface for testing multiple AI providers (cloud and local) on complex mathematical problems with automated grading.
+
+## Features
+
+- **Multi-Provider Support**: 8 AI providers including OpenAI, Anthropic, Google Gemini, xAI, Kimi, VLLM, VLLM Direct, and HuggingFace
+- **Dynamic Model Selection**: Runtime model configuration for optimal cost/performance
+- **Robust Evaluation**: Specialized prompts for mathematical problem solving and grading
+- **Local Model Support**: Run models locally via VLLM server or direct HuggingFace inference
+- **GPU Optimization**: Tested on RTX 3060 with CUDA support
+- **Cost Estimation**: Built-in cost calculation for different providers
+- **Async Processing**: Efficient handling of large datasets
+- **Error Recovery**: Intelligent retry logic and JSON parsing fallbacks
+- **Unified CLI**: Single command-line interface for all operations
+- **Progress Tracking**: Real-time progress bars with success/failure statistics
+- **Multi-Variant Testing**: Test all 6 problem variants with a single command
+
+## Architecture
+
+The project follows a clean, modular architecture:
+
+```
+putnam-bench-anon/
+├── putnam_cli.py          # 🎯 Main CLI interface (your primary entry point)
+├── loader/                # 🔧 Core evaluation engine
+│   ├── __init__.py
+│   ├── providers/         # AI provider implementations
+│   └── utils/             # Utility functions
+├── scripts/               # 📋 Internal scripts (used by CLI)
+│   ├── batch_evaluate.py  # Batch evaluation logic
+│   ├── health_check.py    # Health checking utilities
+│   ├── benchmark.py       # Performance benchmarking
+│   └── compare_*.py       # Analysis scripts
+├── dataset/               # 📚 Problem dataset
+├── results/               # 📊 Evaluation results
+└── requirements*.txt      # 📦 Dependency management
+```
+
+**Key principle**: Use `putnam_cli.py` for all operations - it provides a clean, consistent interface to all functionality.
+
+## Installation
+
+### Quick Start (Automated)
+
+The easiest way to get started is using our automated installation script:
+
+```bash
+# Clone the repository
+git clone <repository-url>
+cd putnam-bench-anon
+
+# One-command setup (installs everything and configures)
+python install.py
+
+# Or quick install (core packages only)
+python install.py --quick
+```
+
+### Manual Installation
+
+#### Environment Setup
+
+We recommend using a dedicated conda environment to avoid dependency conflicts:
+
+```bash
+# Create dedicated environment
+conda create -n putnam-local python=3.10 -y
+conda activate putnam-local
+
+# Clone the repository
+git clone <repository-url>
+cd putnam-bench-anon
+```
+
+#### Choose Your Installation Type
+
+**Option 1: Full Installation (recommended)**
+```bash
+# Install all dependencies including local model support
+pip install -r requirements.txt
+
+# Install PyTorch with CUDA support (for local models)
+pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
+```
+
+**Quick tqdm install for progress bars:**
+```bash
+pip install tqdm
+```
+
+**Option 2: Minimal Installation (cloud providers only)**
+```bash
+# Install only core dependencies for cloud providers
+pip install -r requirements-minimal.txt
+```
+
+**Option 3: Local Models Only**
+```bash
+# Install dependencies for local model inference
+pip install -r requirements-local.txt
+
+# Install PyTorch with CUDA support
+pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
+```
+
+**Option 4: Custom Installation**
+```bash
+# Install core packages manually
+pip install openai anthropic google-generativeai transformers accelerate six
+
+# Optional: For local models
+pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
+
+# Optional: For VLLM (see performance notes below)
+pip install vllm
+```
+
+### GPU Requirements
+
+- **Recommended**: RTX 3060+ with 6GB+ VRAM
+- **Minimum**: Any CUDA-compatible GPU with 4GB+ VRAM
+- **CPU fallback**: Supported but not recommended for performance
+
+### Requirements Files Explained
+
+The project includes several requirements files for different use cases:
+
+- **`requirements.txt`**: Complete installation with all features (recommended)
+  - Includes cloud providers, local models, data analysis tools
+  - Best for full functionality and development
+
+- **`requirements-minimal.txt`**: Minimal installation for cloud providers only
+  - OpenAI, Anthropic, Google Gemini support
+  - Smallest footprint, fastest installation
+  - Best for cloud-only usage
+
+- **`requirements-local.txt`**: Specialized for local model inference
+  - HuggingFace transformers, VLLM, GPU utilities
+  - Includes model optimization tools
+  - Best for privacy-focused or offline usage
+
+Choose the requirements file that matches your intended usage pattern.
+
+## Quick Start
+
+### Basic Usage
+
+```python
+import asyncio
+from loader import create_loader
+import json
+
+async def main():
+    # Create a loader for any provider
+    loader = create_loader("openai", solver_model="gpt-4o-mini", grader_model="gpt-4o-mini")
+    
+    # Load a problem
+    with open("dataset/2000-A-1.json") as f:
+        problem = json.load(f)
+    
+    # Test the problem
+    result = await loader.test_single_problem(problem, variant_type="original")
+    print(f"Grade: {result['final_grade']}")
+    print(f"Solution: {result['solution']}")
+
+asyncio.run(main())
+```
+
+### Command Line Usage
+
+```bash
+# Activate environment first
+conda activate putnam-local
+
+# Use the unified CLI for all operations
+python putnam_cli.py info                    # Show system info
+python putnam_cli.py health --provider openai
+python putnam_cli.py test --provider openai
+python putnam_cli.py solve dataset/2000-A-1.json --provider openai
+python putnam_cli.py batch dataset/ --provider openai --max-files 10
+python putnam_cli.py multi-test --provider openai --max-files 50 --concurrent 100
+```
+
+## Multi-Provider Support
+
+The system supports **8 AI providers** with a unified interface:
+
+### Cloud Providers
+
+#### 1. **OpenAI**
+- **Models**: GPT-4o, GPT-4o-mini, o1, o3
+- **Best for**: High-quality solutions and grading
+- **Setup**: Requires `OPENAI_API_KEY` environment variable
+
+#### 2. **Anthropic**
+- **Models**: Claude-3.5-Sonnet, Claude-3.5-Haiku
+- **Best for**: Detailed reasoning and explanation
+- **Setup**: Requires `ANTHROPIC_API_KEY` environment variable
+
+#### 3. **Google Gemini**
+- **Models**: Gemini-1.5-Pro, Gemini-1.5-Flash
+- **Best for**: Cost-effective high-performance solving
+- **Setup**: Requires `GOOGLE_API_KEY` environment variable
+
+#### 4. **xAI**
+- **Models**: Grok-3, Grok-2
+- **Best for**: Advanced reasoning and mathematical problem solving
+- **Setup**: Requires `XAI_API_KEY` environment variable
+
+#### 5. **Kimi (Moonshot AI)**
+- **Models**: moonshot-v1-8k, moonshot-v1-32k, moonshot-v1-128k
+- **Best for**: Chinese and English mathematical problem solving
+- **Setup**: Requires `MOONSHOT_API_KEY` environment variable
+
+### Local Providers
+
+#### 6. **HuggingFace**
+- **Models**: Any HuggingFace model (tested: GPT-2, DialoGPT)
+- **Performance**: Fast loading (~40s first time, then cached)
+- **Cost**: Free after setup
+- **Privacy**: Complete local inference
+- **Best for**: Development, testing, cost-sensitive applications
+
+#### 7. **VLLM Server**
+- **Models**: Any VLLM-compatible model
+- **Best for**: Multi-user server deployment
+- **Setup**: Requires running separate VLLM server
+- **Performance**: Good for sustained workloads
+
+#### 8. **VLLM Direct** ⚠️
+- **Models**: Any VLLM-compatible model
+- **Performance**: **NOT RECOMMENDED** - 72+ second initialization
+- **Issue**: Extremely slow first load due to graph compilation
+- **Use case**: Only for research/benchmarking VLLM internals
+
+```python
+# NOT RECOMMENDED for interactive use
+loader = create_loader("vllm_direct", solver_model="gpt2")  # Takes 72+ seconds!
+```
+
+## Performance Test Results
+
+Based on testing with RTX 3060 Laptop GPU (6GB VRAM):
+
+| Provider | First Load | Subsequent | GPU Memory | Recommendation |
+|----------|------------|-----------|------------|----------------|
+| API Clients | 1-2s | 1-2s | 0GB (cloud) | ⭐ **Best** |
+| HuggingFace | ~40s | <1s | 0.27GB | ⭐ **Excellent** |
+| VLLM Server | ~30s | <1s | Variable | ✅ Good |
+| VLLM Direct | **72+s** | <1s | 0.24GB | ❌ **Avoid** |
+
+## Configuration Examples
+
+### HuggingFace Local Setup
+
+```python
+# GPU inference (recommended)
+loader = create_loader(
+    "huggingface",
+    solver_model="gpt2",
+    grader_model="gpt2",
+    device="cuda"
+)
+
+# CPU inference (slower)
+loader = create_loader(
+    "huggingface",
+    solver_model="gpt2",
+    grader_model="gpt2",
+    device="cpu"
+)
+```
+
+### OpenAI Configuration
+
+```python
+# High-quality setup
+loader = create_loader(
+    "openai",
+    solver_model="gpt-4o-mini",
+    grader_model="o3"
+)
+```
+
+### Kimi Configuration
+
+```python
+# Standard 8k context window
+loader = create_loader(
+    "kimi",
+    solver_model="moonshot-v1-8k",
+    grader_model="moonshot-v1-8k"
+)
+
+# Large context window for complex problems
+loader = create_loader(
+    "kimi",
+    solver_model="moonshot-v1-128k",
+    grader_model="moonshot-v1-128k"
+)
+```
+
+### VLLM Server Setup
+
+```bash
+# Start VLLM server first (separate terminal):
+conda activate putnam-local
+vllm serve meta-llama/Llama-3.2-8B-Instruct --port 8000
+```
+
+```python
+# Then use the client
+loader = create_loader(
+    "vllm",
+    solver_model="meta-llama/Llama-3.2-8B-Instruct",
+    base_url="http://localhost:8000/v1"
+)
+```
+
+## Recommended Usage Patterns
+
+### For Development/Testing
+```python
+# Fast iteration with local models
+loader = create_loader("huggingface", solver_model="gpt2", device="cuda")
+```
+
+### For Production/Important Results
+```python
+# High-quality cloud models
+loader = create_loader("openai", solver_model="gpt-4o-mini", grader_model="o3")
+loader = create_loader("anthropic", solver_model="claude-3-5-sonnet")
+```
+
+### For Privacy-Sensitive Work
+```python
+# Completely local inference
+loader = create_loader("huggingface", solver_model="gpt2", device="cuda")
+```
+
+### For Cost Optimization
+```python
+# Free local models after setup
+loader = create_loader("huggingface", solver_model="gpt2", device="cuda")
+
+# Or cost-effective cloud
+loader = create_loader("openai", solver_model="gpt-4o-mini")
+```
+
+## Dataset Structure
+
+The dataset contains Putnam Competition problems in JSON format:
+
+```json
+{
+    "original": {
+        "problem_statement": "Mathematical problem...",
+        "solution": "Step-by-step solution...",
+        "problem_type": "proof"
+    },
+    "descriptive_long": {...},
+    "kernel_variant": {...}
+}
+```
+
+### Problem Variants
+
+- **original**: The standard problem statement
+- **descriptive_long**: More verbose problem description
+- **descriptive_long_confusing**: Intentionally confusing wording
+- **descriptive_long_misleading**: Misleading formulation
+- **garbled_string**: Corrupted text version
+- **kernel_variant**: Minimal essential formulation
+
+## Advanced Usage
+
+### GPU Memory Management
+
+```python
+# Check GPU status
+import torch
+if torch.cuda.is_available():
+    print(f"GPU: {torch.cuda.get_device_name(0)}")
+    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")
+```
+
+### Cost Estimation
+
+```python
+# Estimate costs before processing
+cost_info = await loader.estimate_cost(
+    num_problems=100,
+    avg_problem_length=1000,
+    avg_solution_length=2000
+)
+
+print(f"Estimated cost: ${cost_info['total_cost']:.2f}")
+print(f"Cost per problem: ${cost_info['cost_per_problem']:.4f}")
+```
+
+### Batch Processing
+
+```python
+async def process_dataset(dataset_path, provider="openai"):
+    loader = create_loader(provider)
+    
+    problems = []
+    for file_path in Path(dataset_path).glob("*.json"):
+        with open(file_path) as f:
+            problems.append(json.load(f))
+    
+    # Process problems concurrently
+    tasks = [
+        loader.test_single_problem(problem, variant_type="original")
+        for problem in problems
+    ]
+    
+    results = await asyncio.gather(*tasks, return_exceptions=True)
+    return results
+```
+
+#### Incremental Saving and Resume Support
+
+The batch evaluation now supports incremental saving and resume functionality:
+
+```bash
+# Start a new evaluation with incremental saving
+python putnam_cli.py batch dataset/ --provider openai --output results/my_results.json
+
+# If interrupted, resume from checkpoint (no other arguments needed!)
+python putnam_cli.py batch --resume results/checkpoint_my_results_20240101_120000.json
+
+# The checkpoint file is automatically created and updated after each problem
+# It contains all progress and can be used to resume if the process is interrupted
+```
+
+**Features:**
+- **Automatic Checkpointing**: Results are saved after each problem completes
+- **Resume Support**: Continue from where you left off if interrupted
+- **Backward Compatibility**: Existing checkpoint files continue to work
+- **Atomic Saves**: Uses temporary files to ensure data integrity
+- **Progress Tracking**: Shows saved/completed problems in real-time
+- **Automatic Cleanup**: Checkpoint files are removed after successful completion
+
+**Resume Usage:**
+```bash
+# For new checkpoint files (created after this update)
+python putnam_cli.py batch --resume checkpoint_file.json
+
+# For existing checkpoint files (created before this update)
+python putnam_cli.py batch dataset/ --provider openai --resume old_checkpoint_file.json
+```
+
+The system automatically detects checkpoint format and handles both old and new formats seamlessly.
+
+### Multi-Variant Testing
+
+Test all 6 problem variants with a single command using the unified CLI:
+
+```bash
+# Test all variants with OpenAI (50 files per variant)
+python putnam_cli.py multi-test --provider openai --max-files 50
+
+# Test specific variants only
+python putnam_cli.py multi-test --provider anthropic --variants original kernel_variant --max-files 25
+
+# Test with custom models and high concurrency
+python putnam_cli.py multi-test --provider openai --solver-model gpt-4.1-nano --grader-model o3 --max-files 1051 --concurrent 1100
+
+# Test with local models
+python putnam_cli.py multi-test --provider huggingface --device cuda --max-files 100
+
+# Test with VLLM server
+python putnam_cli.py multi-test --provider vllm --vllm-url http://localhost:8000/v1 --max-files 50
+```
+
+**Available Problem Variants:**
+- `original` - Standard mathematical problems
+- `descriptive_long` - Clear variable renaming
+- `descriptive_long_confusing` - Random unrelated words (marshmallow, armadillo, etc.)
+- `descriptive_long_misleading` - Misleading variable names (nonpositiveterm, negativeinitial, etc.)
+- `garbled_string` - Completely scrambled variable names  
+- `kernel_variant` - Simplified core mathematical version
+
+**Output Structure:**
+```
+multi_variant_results/
+├── openai_gpt-4o-mini_20241201_143022/
+│   ├── original_20241201_143022.json
+│   ├── descriptive_long_20241201_143022.json
+│   ├── ...
+│   └── SUMMARY_openai_gpt-4o-mini_20241201_143022.json
+└── multi_config_comparison_20241201_143022.json
+```
+
+## Troubleshooting
+
+### CUDA Issues
+```bash
+# Check CUDA compatibility
+python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"
+
+# If CUDA issues, reinstall PyTorch
+pip uninstall torch torchvision torchaudio -y
+pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
+```
+
+### Memory Issues
+```python
+# Clear GPU cache
+import torch
+if torch.cuda.is_available():
+    torch.cuda.empty_cache()
+```
+
+### VLLM Performance Issues
+- **Problem**: VLLM Direct takes 72+ seconds to load
+- **Solution**: Use HuggingFace local or VLLM server instead
+- **Alternative**: Use cloud providers for better speed
+
+## Testing
+
+```bash
+# Activate environment
+conda activate putnam-local
+
+# Quick health checks
+python -c "
+import asyncio
+from loader import create_loader
+
+async def test():
+    # Test available providers
+    providers = [
+        ('openai', 'gpt-4o-mini'), 
+        ('huggingface', 'gpt2'),
+        ('anthropic', 'claude-3-5-haiku'),
+        ('kimi', 'moonshot-v1-8k')
+    ]
+    for provider, model in providers:
+        try:
+            loader = create_loader(provider, solver_model=model)
+            result = await loader.health_check()
+            print(f'{provider}: {"✅ Ready" if result else "❌ Failed"}')
+        except Exception as e:
+            print(f'{provider}: ❌ Error - {e}')
+
+asyncio.run(test())
+"
+```
+
+## Performance Recommendations
+
+### For Speed ⚡
+1. **API Clients**: Sub-2s response (OpenAI, Anthropic, xAI, Gemini)
+2. **HuggingFace**: Fast after first load, completely local
+3. **VLLM Server**: Good for sustained workloads
+
+### For Cost 💰
+1. **HuggingFace**: Free after setup, local inference
+2. **OpenAI GPT-4o-mini**: Cost-effective cloud option
+3. **Gemini Flash**: Good price/performance ratio
+
+### For Privacy 🔒
+1. **HuggingFace**: Complete local inference
+2. **VLLM Server**: Local deployment with server architecture
+
+### For Quality 🎯
+1. **OpenAI o3**: Excellent grading capability
+2. **Claude-3.5-Sonnet**: Detailed explanations
+3. **Gemini Pro**: Strong mathematical reasoning
+
+## API Reference
+
+### Supported Providers
+- `openai`: OpenAI GPT models
+- `anthropic`: Anthropic Claude models  
+- `gemini`: Google Gemini models
+- `xai`: xAI Grok models
+- `kimi`: Kimi/Moonshot models
+- `vllm`: VLLM server models
+- `vllm_direct`: Direct VLLM API (slow)
+- `huggingface`: HuggingFace transformers
+
+### Factory Function
+
+```python
+create_loader(provider: str, **kwargs) -> ModelLoader
+```
+
+### Key Methods
+
+```python
+# Test a single problem
+await loader.test_single_problem(problem_data, variant_type)
+
+# Health check
+await loader.health_check()
+
+# Cost estimation
+await loader.estimate_cost(num_problems)
+
+# Model information
+loader.get_model_info()
+```
+
+## Contributing
+
+1. Fork the repository
+2. Create a feature branch (`git checkout -b feature/amazing-feature`)
+3. Test with the putnam-local environment
+4. Add tests for new functionality
+5. Submit a pull request
+
+## License
+
+[License information]
+
+## Citation
+
+If you use this code in your research, please cite:
+
+```bibtex
+@misc{putnam-bench-anon,
+    title={Putnam Mathematical Problem Solver},
+    year={2024},
+    howpublished={\url{<repository-url>}}
+}
+```
+\ No newline at end of file
author	Yuren Hao <yurenh2@illinois.edu>	2026-04-08 22:06:05 -0500
committer	Yuren Hao <yurenh2@illinois.edu>	2026-04-08 22:06:05 -0500
commit	05704d0eb2fa59fe727652465b07db40bcb06c38 (patch)
tree	8904aca836cf552fd1a5ae8c2174e9f91e70bbbc /putnam-bench-anon/README.md