diff options
| author | Yuren Hao <yurenh2@illinois.edu> | 2026-04-08 22:06:05 -0500 |
|---|---|---|
| committer | Yuren Hao <yurenh2@illinois.edu> | 2026-04-08 22:06:05 -0500 |
| commit | 05704d0eb2fa59fe727652465b07db40bcb06c38 (patch) | |
| tree | 8904aca836cf552fd1a5ae8c2174e9f91e70bbbc | |
Initial release: GAP framework
- Full pipeline: variant generation, multi-judge verification, evaluation
- Loaders for OpenAI / Anthropic / Google / xAI / OpenRouter / vLLM
- Framework-level mechanism analyses: paired structural overlap, repairability rescue, self-correction probe, cross-model agreement, topic x problem-type interaction
- Unicode -> bare-LaTeX cleaner + audit + spot-check
- Mirrors https://huggingface.co/datasets/blackhao0426/PutnamGAP
63 files changed, 17117 insertions, 0 deletions
diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..f97bb38 --- /dev/null +++ b/.gitignore @@ -0,0 +1,36 @@ +__pycache__/ +*.pyc +*.pyo +*.pyd +.Python +*.so +*.egg +*.egg-info/ + +# Local results / outputs +results/ +results_new/ +runs/ +outputs/ +*.log + +# Data caches +.cache/ +huggingface_cache/ + +# Editor / OS +.DS_Store +.idea/ +.vscode/ +*.swp +.ipynb_checkpoints/ + +# Secrets +.env +.env.* +api_keys.txt +secrets.json +*.pem +*.key +.putnam_config.json +.putnam_env diff --git a/CITATION.cff b/CITATION.cff new file mode 100644 index 0000000..b77a346 --- /dev/null +++ b/CITATION.cff @@ -0,0 +1,88 @@ +cff-version: 1.2.0 +message: | + If you use the GAP framework or the PutnamGAP dataset, you MUST cite both + this work and the four MAA Putnam source books listed in the README under + "Citation". +title: "GAP — Generalization-and-Perturbation Framework for LLM Mathematical Reasoning Robustness" +authors: + - family-names: Hao + given-names: Yuren + - family-names: Wan + given-names: Xiang + - family-names: Zhai + given-names: ChengXiang +year: 2025 +url: "https://github.com/YurenHao0426/GAP" +repository-code: "https://github.com/YurenHao0426/GAP" +license: CC-BY-4.0 +preferred-citation: + type: article + title: | + An Investigation of Robustness of LLMs in Mathematical Reasoning: + Benchmarking with Mathematically-Equivalent Transformation of Advanced + Mathematical Problems + authors: + - family-names: Hao + given-names: Yuren + - family-names: Wan + given-names: Xiang + - family-names: Zhai + given-names: ChengXiang + year: 2025 + journal: "arXiv preprint arXiv:2508.08833" + url: "https://arxiv.org/abs/2508.08833" +references: + - type: book + title: "The William Lowell Putnam Mathematical Competition: Problems and Solutions 1938-1964" + authors: + - family-names: Gleason + given-names: A. M. + - family-names: Greenwood + given-names: R. E. + - family-names: Kelly + given-names: L. M. + year: 1980 + publisher: + name: Mathematical Association of America + notes: "MAA Problem Books, vol. 1; reprinted by AMS/MAA Press" + - type: book + title: "The William Lowell Putnam Mathematical Competition: Problems and Solutions 1965-1984" + authors: + - family-names: Alexanderson + given-names: Gerald L. + - family-names: Klosinski + given-names: Leonard F. + - family-names: Larson + given-names: Loren C. + year: 1985 + publisher: + name: Mathematical Association of America + notes: "MAA Problem Books, vol. 30" + - type: book + title: "The William Lowell Putnam Mathematical Competition 1985-2000: Problems, Solutions and Commentary" + authors: + - family-names: Kedlaya + given-names: Kiran S. + - family-names: Poonen + given-names: Bjorn + - family-names: Vakil + given-names: Ravi + year: 2002 + publisher: + name: Mathematical Association of America + notes: "MAA Problem Books, vol. 33" + - type: book + title: "The William Lowell Putnam Mathematical Competition 2001-2016: Problems, Solutions and Commentary" + authors: + - family-names: Kedlaya + given-names: Kiran S. + - family-names: Kane + given-names: Daniel M. + - family-names: Kane + given-names: Jonathan M. + - family-names: O'Dorney + given-names: Evan M. + year: 2020 + publisher: + name: American Mathematical Society (MAA Press) + notes: "MAA Problem Books, vol. 37" @@ -0,0 +1,60 @@ +GAP Framework License +===================== + +PART 1 -- All pipeline source code, variant generation scripts, evaluation +harness, structural-overlap analysis, rescue runner, cleaning/audit tools, +README, datasheet, and any other artefacts authored by the GAP project are +released under the: + + Creative Commons Attribution 4.0 International License (CC BY 4.0) + https://creativecommons.org/licenses/by/4.0/legalcode + +You are free to: + - Share -- copy and redistribute the material in any medium or format + - Adapt -- remix, transform, and build upon the material for any purpose, + even commercially. + +Under the following terms: + - Attribution -- You must give appropriate credit, provide a link to the + license, and indicate if changes were made. You may do so in any + reasonable manner, but not in any way that suggests the licensor + endorses you or your use. + +PART 2 -- ORIGINAL PUTNAM PROBLEMS AND CANONICAL SOLUTIONS + +The companion PutnamGAP dataset (https://huggingface.co/datasets/blackhao0426/PutnamGAP) +contains the original problem statements and canonical solutions of the +William Lowell Putnam Mathematical Competition, reproduced from four +authoritative monographs published by the Mathematical Association of +America (MAA Press) and redistributed under the fair-use clause stated in +the front-matter of every volume: + + "Individual readers ... are permitted to make fair use of the material, + such as to copy select pages for use in teaching or research." + +These texts remain the intellectual property of the Mathematical +Association of America (MAA). They are NOT licensed under CC BY 4.0. + +If you use this code or the PutnamGAP dataset in any research output you +MUST cite the four MAA source books in addition to the GAP framework +paper. The four source books are: + + 1. Gleason, Greenwood, Kelly. The William Lowell Putnam Mathematical + Competition: Problems and Solutions 1938-1964. MAA, 1980. + 2. Alexanderson, Klosinski, Larson. The William Lowell Putnam + Mathematical Competition: Problems and Solutions 1965-1984. MAA, 1985. + 3. Kedlaya, Poonen, Vakil. The William Lowell Putnam Mathematical + Competition 1985-2000: Problems, Solutions and Commentary. MAA, 2002. + 4. Kedlaya, Kane, Kane, O'Dorney. The William Lowell Putnam Mathematical + Competition 2001-2016: Problems, Solutions and Commentary. AMS + (MAA Press), 2020. + +Problem and solution sets from 2017 onward are included with the explicit +permission of MAA. + +NOTICE FOR RIGHTS-HOLDERS + +If you are an author, publisher, or rights-holder and you believe any +portion of this release infringes your rights, please open an issue at +https://github.com/YurenHao0426/GAP/issues or email the maintainer. +The affected items will be removed promptly. diff --git a/README.md b/README.md new file mode 100644 index 0000000..846e176 --- /dev/null +++ b/README.md @@ -0,0 +1,226 @@ +# GAP — Generalization-and-Perturbation Framework + +[](https://arxiv.org/abs/2508.08833) +[](https://huggingface.co/datasets/blackhao0426/PutnamGAP) +[](https://creativecommons.org/licenses/by/4.0/) + +**GAP** (*Generalization-and-Perturbation*) is an automatable evaluation framework for stress-testing the **robustness of LLM mathematical reasoning** under semantically equivalent transformations of advanced math problems. It partitions equivalence-preserving transformations into two qualitatively different families — **surface renaming** and **kernel parameter resampling** — and provides paired-evaluation, mechanism-sensitive analyses that prior perturbation benchmarks cannot. + +This repository contains the **complete pipeline source code**: + +- The variant generation and evaluation pipeline used to build PutnamGAP and to evaluate 18 LLMs +- The label-free structural-overlap analysis used in our framework-level mechanism analyses +- The repairability rescue harness with three-condition prefix injection +- Auxiliary cleaning, audit, aggregation, and figure-rendering scripts + +The companion **PutnamGAP dataset** (1,051 Putnam problems × 5 mathematically equivalent variants = 6,306 items) is hosted on Hugging Face: <https://huggingface.co/datasets/blackhao0426/PutnamGAP>. + +The accompanying paper is on arXiv: [2508.08833](https://arxiv.org/abs/2508.08833). + +--- + +## What this repo provides + +| Directory | Contents | +|---|---| +| `putnam-bench-anon/` | The main GAP CLI and pipeline. Loaders for OpenAI / Anthropic / Google / xAI / OpenRouter / vLLM, multi-judge variant verification (`scripts/`), surface and kernel variant generators, end-to-end evaluation runner (`putnam_cli.py`), setup helper, install script, and per-provider prompt templates. | +| `putnamsup/` | Standalone runners using the OpenRouter API and a local-model HuggingFace inference path. Includes `evaluate_putnam_gap.py`, `run_putnam_gap.py`, `run_putnam_gap_openrouter.py`, and the `putnamgap_viewer.py` browser. | +| `analysis/` | Framework-level mechanism analyses: paired structural overlap (`structural_overlap.py`, `kv_overlap.py`, `aggregate_overlap.py`), repairability rescue (`rescue_runner.py`, `rescue_prompts.py`, `rescue_api.py`, `rescue_analyze.py`, `rescue_pooled.py`), self-correction probe (`self_correction.py`, `sc_success_and_difficulty.py`), cross-model agreement (`cross_model_agreement.py`), topic × problem-type interaction (`topic_problemtype_interaction.py`), spontaneous-normalization sub-finding (`normalization_analysis.py`), figure rendering (`make_figures.py`), Unicode → LaTeX cleaner and audit (`unicode_clean.py`, `unicode_audit.py`, `balance_diff.py`, `spotcheck_clean.py`). | +| `mini_gap_math*.py`, `kv_math*.py` | Stand-alone scripts used to instantiate GAP on the MATH benchmark (Mini-GAP-MATH) and to run kernel-variant generation experiments. | + +--- + +## Installation + +```bash +git clone https://github.com/YurenHao0426/GAP.git +cd GAP/putnam-bench-anon +python -m pip install -r requirements.txt # or requirements-minimal.txt for CPU-only +``` + +Set the provider API keys you intend to use as environment variables (the pipeline reads them via `os.getenv` — there are no hard-coded credentials): + +```bash +export OPENAI_API_KEY=sk-... +export ANTHROPIC_API_KEY=sk-ant-... +export GOOGLE_API_KEY=... +export XAI_API_KEY=... # optional +export OPENROUTER_API_KEY=... # optional +``` + +Then download the PutnamGAP dataset: + +```bash +python -c " +from huggingface_hub import snapshot_download +snapshot_download('blackhao0426/PutnamGAP', repo_type='dataset', + local_dir='./putnam-bench-anon/dataset', allow_patterns='dataset/*.json') +" +``` + +(If you prefer, you can also use `datasets.load_dataset('blackhao0426/PutnamGAP')` directly inside Python and bypass the on-disk layout.) + +--- + +## Quick start + +### 1. Run the GAP evaluation pipeline on PutnamGAP + +```bash +cd putnam-bench-anon +python putnam_cli.py --solver-model gpt-4o-mini --grader-model gpt-4o \ + --variant original --output ../runs/gpt-4o-mini-original.json +python putnam_cli.py --solver-model gpt-4o-mini --grader-model gpt-4o \ + --variant garbled_string --output ../runs/gpt-4o-mini-gs.json +``` + +The CLI iterates over the 1,051 problems in the configured `dataset/` directory, calls the solver model, scores the response with the grader model, and writes a structured JSON results file containing per-problem solve and grade records (`solve.solution`, `grade.grade`, `correct`, etc.). + +### 2. Generate framework-level mechanism analyses + +After you have run the solver pipeline on at least the original variant and one surface variant, you can run the label-free structural overlap analysis: + +```bash +python analysis/structural_overlap.py +python analysis/aggregate_overlap.py +``` + +This will produce per-cell Cohen's *d* statistics for the stable-vs-brittle structural overlap dichotomy described in the paper. + +### 3. Run the repairability rescue experiment + +```bash +python analysis/rescue_runner.py --pilot # 5 cases per cell smoke test +python analysis/rescue_runner.py --cases 30 # full rescue run +python analysis/rescue_pooled.py # pooled summary tables +``` + +The rescue harness re-solves each flip case under three prefix conditions (`canonical_T2`, `own_T2`, `null`) using the same model under test, then grades the new attempt. + +### 4. Plot the headline figures + +```bash +python analysis/make_figures.py +``` + +--- + +## Reproducing the published results + +The exact configuration we used is: + +| Step | Models | +|---|---| +| Solver evaluation (18 models) | OpenAI o3, o4-mini, gpt-4.1, gpt-4.1-mini, gpt-4.1-nano, gpt-4o, gpt-4o-mini; Anthropic claude-opus-4, claude-sonnet-4; Google gemini-2.5-pro, gemini-2.5-flash, gemini-2.5-flash-lite; xAI grok-4; Alibaba qwen3-235B; Meta llama-4 maverick; Moonshot kimi-k2; DeepSeek deepseek-prover; Mistral devstral-medium | +| Kernel variant generation (5-judge unanimous) | o3, claude-sonnet-4, gemini-2.5-flash, gpt-4.1-mini, gpt-4o | +| Grader (paper) | o3 (primary) with three-provider cross-check via gpt-4o + claude-sonnet-4 + gemini-2.5-flash on a stratified subset (κ = 0.96–1.00) | +| Rescue experiment (this repo) | 4 representative models: gpt-4.1-mini, gpt-4o-mini, claude-sonnet-4, gemini-2.5-flash, with gpt-4o as the grader | + +The three-provider grader cross-check protocol is documented in `putnam-bench-anon/scripts/regrade.py`. The 5-judge kernel variant verification is documented in `putnam-bench-anon/scripts/compare_original_vs_kernel_test.py`. + +--- + +## Important: Source Attribution + +> **The original Putnam Competition problem statements and the canonical solutions in PutnamGAP are reproduced from four authoritative monographs published by the Mathematical Association of America (MAA Press), under the fair-use clause printed in the front-matter of every volume:** +> +> *"Individual readers ... are permitted to make fair use of the material, such as to copy select pages for use in teaching or research."* +> +> **All original problem statements and canonical solutions remain the intellectual property of the MAA. If you use this code or the PutnamGAP dataset for any research output, you MUST cite both the GAP framework paper AND the four MAA source books listed below. Failure to do so misrepresents the provenance of the original problems.** + +Problem and solution sets from 2017 onward are included in PutnamGAP with the explicit permission of MAA. + +**Takedown notice.** If you are an author, publisher, or rights-holder and you believe any portion of this release infringes your rights, please open an issue at <https://github.com/YurenHao0426/GAP/issues> or email the maintainer. The affected items will be removed promptly. + +--- + +## Citation + +If you use this code or the PutnamGAP dataset, you **must** cite **all five** entries below: the GAP framework paper **and** the four MAA Putnam source books that the original problems and solutions are reproduced from. Citing fewer is a misrepresentation of the dataset's provenance. + +In-text example: + +> "We evaluate on PutnamGAP \cite{hao2025gap, putnamI, putnamII, putnamIII, putnamIV}." + +Full BibTeX (copy the entire block — all five entries are mandatory): + +```bibtex +@article{hao2025gap, + title = {An Investigation of Robustness of {LLM}s in Mathematical Reasoning: + Benchmarking with Mathematically-Equivalent Transformation of + Advanced Mathematical Problems}, + author = {Hao, Yuren and Wan, Xiang and Zhai, ChengXiang}, + journal = {arXiv preprint arXiv:2508.08833}, + year = {2025}, + url = {https://arxiv.org/abs/2508.08833} +} + +@book{putnamI, + title = {The William Lowell Putnam Mathematical Competition: + Problems and Solutions 1938--1964}, + author = {Gleason, A. M. and Greenwood, R. E. and Kelly, L. M.}, + publisher = {Mathematical Association of America}, + year = {1980}, + series = {MAA Problem Books}, + volume = {1}, + address = {Washington, DC}, + note = {673\,pp; reprinted by AMS/MAA Press} +} + +@book{putnamII, + title = {The William Lowell Putnam Mathematical Competition: + Problems and Solutions 1965--1984}, + author = {Alexanderson, Gerald L. and Klosinski, Leonard F. and + Larson, Loren C.}, + publisher = {Mathematical Association of America}, + year = {1985}, + series = {MAA Problem Books}, + volume = {30}, + address = {Washington, DC}, + note = {Reprinted by AMS/MAA Press} +} + +@book{putnamIII, + title = {The William Lowell Putnam Mathematical Competition 1985--2000: + Problems, Solutions and Commentary}, + author = {Kedlaya, Kiran S. and Poonen, Bjorn and Vakil, Ravi}, + publisher = {Mathematical Association of America}, + year = {2002}, + series = {MAA Problem Books}, + volume = {33}, + address = {Washington, DC}, + note = {Reprinted by AMS/MAA Press} +} + +@book{putnamIV, + title = {The William Lowell Putnam Mathematical Competition 2001--2016: + Problems, Solutions and Commentary}, + author = {Kedlaya, Kiran S. and Kane, Daniel M. and Kane, Jonathan M. and + O'Dorney, Evan M.}, + publisher = {American Mathematical Society (MAA Press)}, + year = {2020}, + series = {MAA Problem Books}, + volume = {37}, + address = {Providence, RI}, + note = {Softcover and e-book versions available} +} +``` + +> **Reminder.** The four `putnamI`–`putnamIV` entries are not optional or supplementary; the original problem statements and canonical solutions in PutnamGAP are reproduced from those four MAA monographs under the MAA fair-use clause, and the IP belongs to the Mathematical Association of America. Any downstream use of this code or dataset that omits the four MAA citations misrepresents the dataset's provenance. + +--- + +## License + +- The **pipeline source code**, **variant generation scripts**, **evaluation harness**, **structural-overlap analysis**, **rescue runner**, **cleaning/audit tools**, and any other artefact authored by the GAP project is released under the [Creative Commons Attribution 4.0 International License (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/). +- The **original Putnam Competition problem statements and canonical solutions** that the pipeline operates on remain copyrighted by the Mathematical Association of America (MAA). They are redistributed in the companion PutnamGAP dataset under MAA's stated fair-use clause and only for educational and research use. **Downstream users must cite the four MAA source books listed above.** + +--- + +## Links + +- **Paper (arXiv)**: <https://arxiv.org/abs/2508.08833> +- **Hugging Face dataset**: <https://huggingface.co/datasets/blackhao0426/PutnamGAP> +- **GAP framework code (this repo)**: <https://github.com/YurenHao0426/GAP> +- **PutnamGAP companion code mirror**: <https://github.com/YurenHao0426/PutnamGAP> +- **Issues & contact**: <https://github.com/YurenHao0426/GAP/issues> diff --git a/analysis/aggregate_overlap.py b/analysis/aggregate_overlap.py new file mode 100644 index 0000000..cd6b53e --- /dev/null +++ b/analysis/aggregate_overlap.py @@ -0,0 +1,91 @@ +"""Aggregate structural_overlap results by variant type and by model. + +Produces a clean rebuttal table. +""" +from __future__ import annotations +import json +import statistics +from pathlib import Path +from collections import defaultdict + +RESULTS = Path("/home/yurenh2/gap/analysis/structural_overlap_results.json") +SHORT = {"descriptive_long":"DL","descriptive_long_confusing":"DLC", + "descriptive_long_misleading":"DLM","garbled_string":"GS"} + + +def main(): + cells = json.load(open(RESULTS)) + print(f"Loaded {len(cells)} cells.\n") + + # Per-variant aggregate + per_variant = defaultdict(list) + for c in cells: + per_variant[c["variant"]].append(c) + + print("=" * 90) + print("HEADLINE TABLE: Surface variants — stable vs brittle structural overlap") + print("(token Jaccard on canonicalized trajectories, drift cases only)") + print("=" * 90) + print(f"\n{'Variant':<6} {'#cells':>7} {'#dir+':>6} {'#p<.05':>8} " + f"{'med-d':>7} {'mean-d':>7} {'mean-dlt':>9} " + f"{'mean-stbl':>10} {'mean-brit':>10} {'mean-noise':>11} " + f"{'mean-collapse%':>14}") + print("-" * 100) + for v, cs in per_variant.items(): + ds = [c["metrics"]["token_jaccard"]["cohens_d"] for c in cs] + ps = [c["metrics"]["token_jaccard"]["p_two_sided"] for c in cs] + n_pos = sum(1 for d in ds if d > 0) + n_sig = sum(1 for p in ps if p < 0.05) + deltas = [c["metrics"]["token_jaccard"]["delta_median"] for c in cs] + stbl = [c["metrics"]["token_jaccard"]["stable_median"] for c in cs] + brit = [c["metrics"]["token_jaccard"]["brittle_median"] for c in cs] + noise = [c["metrics"]["token_jaccard"]["noise_floor_median"] for c in cs + if c["metrics"]["token_jaccard"].get("noise_floor_median") is not None] + collapse = [c["brittle_collapse_rate"] for c in cs] + print(f"{SHORT[v]:<6} {len(cs):>7} {n_pos:>6} {n_sig:>8} " + f"{statistics.median(ds):>+7.2f} {statistics.fmean(ds):>+7.2f} " + f"{statistics.fmean(deltas):>+9.4f} " + f"{statistics.fmean(stbl):>10.3f} {statistics.fmean(brit):>10.3f} " + f"{statistics.fmean(noise):>11.3f} " + f"{statistics.fmean(collapse)*100:>13.1f}%") + + # Variant-aggregate (across all models, n-weighted) + print("\n" + "=" * 90) + print("ALL CELLS (18 models × 4 surface variants)") + print("=" * 90) + all_d = [c["metrics"]["token_jaccard"]["cohens_d"] for c in cells] + all_p = [c["metrics"]["token_jaccard"]["p_two_sided"] for c in cells] + print(f" cells: {len(cells)}") + print(f" direction-positive: {sum(1 for d in all_d if d>0)}/{len(cells)}") + print(f" p<0.05: {sum(1 for p in all_p if p<0.05)}/{len(cells)}") + print(f" p<0.001: {sum(1 for p in all_p if p<0.001)}/{len(cells)}") + print(f" p<1e-6: {sum(1 for p in all_p if p<1e-6)}/{len(cells)}") + print(f" Cohen's d median: {statistics.median(all_d):+.3f}") + print(f" Cohen's d mean: {statistics.fmean(all_d):+.3f}") + print(f" Cohen's d range: [{min(all_d):+.2f}, {max(all_d):+.2f}]") + + # Per-model aggregate (averaged across 4 surface variants) + per_model = defaultdict(list) + for c in cells: + per_model[c["model"]].append(c) + print("\n" + "=" * 90) + print("PER MODEL (averaged across 4 surface variants)") + print("=" * 90) + print(f"\n{'Model':<25} {'mean-d':>7} {'mean-stbl':>10} {'mean-brit':>10} " + f"{'mean-coll%':>11} {'min-p':>9}") + print("-" * 80) + rows = [] + for m, cs in per_model.items(): + if len(cs) == 0: continue + d = statistics.fmean(c["metrics"]["token_jaccard"]["cohens_d"] for c in cs) + s = statistics.fmean(c["metrics"]["token_jaccard"]["stable_median"] for c in cs) + b = statistics.fmean(c["metrics"]["token_jaccard"]["brittle_median"] for c in cs) + col = statistics.fmean(c["brittle_collapse_rate"] for c in cs) * 100 + mp = min(c["metrics"]["token_jaccard"]["p_two_sided"] for c in cs) + rows.append((m, d, s, b, col, mp)) + for r in sorted(rows, key=lambda r: -r[1]): + print(f"{r[0]:<25} {r[1]:>+7.2f} {r[2]:>10.3f} {r[3]:>10.3f} {r[4]:>10.1f}% {r[5]:>9.1e}") + + +if __name__ == "__main__": + main() diff --git a/analysis/balance_diff.py b/analysis/balance_diff.py new file mode 100644 index 0000000..f420d46 --- /dev/null +++ b/analysis/balance_diff.py @@ -0,0 +1,109 @@ +"""Compare brace/paren/bracket balance BEFORE vs AFTER cleaning to check +whether the cleaner introduced any new imbalance.""" +from __future__ import annotations +import json +import tarfile +from pathlib import Path +from collections import Counter + +CURRENT_DIR = Path("/home/yurenh2/gap/putnam-bench-anon/dataset") +BACKUP_TAR = sorted(Path("/home/yurenh2/gap/analysis/dataset_backups").glob( + "putnam-bench-anon_dataset_*.tar.gz"))[-1] + + +def all_text(d: dict) -> str: + out = [] + for k in ("question", "solution"): + out.append(d.get(k) or "") + for vk, vd in (d.get("variants") or {}).items(): + if isinstance(vd, dict): + for k in ("question", "solution"): + out.append(vd.get(k) or "") + return "\n".join(out) + + +def balance(text: str): + return ( + text.count("{") - text.count("}"), + text.count("(") - text.count(")"), + text.count("[") - text.count("]"), + ) + + +def main(): + print("Loading backup ...") + backup = {} + with tarfile.open(BACKUP_TAR, "r:gz") as tar: + for member in tar.getmembers(): + if not member.isfile() or not member.name.endswith(".json"): + continue + f = tar.extractfile(member) + if not f: + continue + d = json.load(f) + backup[d.get("index")] = all_text(d) + print(f" loaded {len(backup)} backup problems") + + print("Loading current ...") + current = {} + for f in sorted(CURRENT_DIR.glob("*.json")): + d = json.load(open(f)) + current[d.get("index")] = all_text(d) + print(f" loaded {len(current)} current problems") + + # Per-file balance diff + introduced_imbalance = [] + fixed_imbalance = [] + same_imbalance = 0 + same_balanced = 0 + + n_brace_changed = 0 + n_paren_changed = 0 + n_brack_changed = 0 + + for idx in sorted(backup): + b_before = balance(backup[idx]) + b_after = balance(current.get(idx, "")) + was_bal = b_before == (0, 0, 0) + is_bal = b_after == (0, 0, 0) + if b_before != b_after: + if was_bal and not is_bal: + introduced_imbalance.append((idx, b_before, b_after)) + elif not was_bal and is_bal: + fixed_imbalance.append((idx, b_before, b_after)) + else: + if is_bal: + same_balanced += 1 + else: + same_imbalance += 1 + if b_before[0] != b_after[0]: n_brace_changed += 1 + if b_before[1] != b_after[1]: n_paren_changed += 1 + if b_before[2] != b_after[2]: n_brack_changed += 1 + + print(f"\n=== Per-file balance change summary ===") + print(f" Files with no change in any balance:") + print(f" balanced both before and after: {same_balanced}") + print(f" imbalanced before and after (same imbalance): {same_imbalance}") + print(f" Files where cleaner INTRODUCED new imbalance: " + f"{len(introduced_imbalance)}") + print(f" Files where cleaner FIXED prior imbalance: {len(fixed_imbalance)}") + print() + print(f" Files where {{ balance changed: {n_brace_changed}") + print(f" Files where ( balance changed: {n_paren_changed}") + print(f" Files where [ balance changed: {n_brack_changed}") + + if introduced_imbalance: + print(f"\n!!! Cleaner-introduced imbalances ({len(introduced_imbalance)}):") + for idx, before, after in introduced_imbalance[:10]: + print(f" {idx}: before={before}, after={after}") + else: + print("\n ✓ No cleaner-introduced imbalances found.") + + if fixed_imbalance: + print(f"\n Cleaner-fixed imbalances (top 10):") + for idx, before, after in fixed_imbalance[:10]: + print(f" {idx}: before={before}, after={after}") + + +if __name__ == "__main__": + main() diff --git a/analysis/cross_model_agreement.py b/analysis/cross_model_agreement.py new file mode 100644 index 0000000..fb9a571 --- /dev/null +++ b/analysis/cross_model_agreement.py @@ -0,0 +1,180 @@ +"""Cross-model agreement analysis: which problems are universally hard? + +For each (variant, problem) cell, count how many models fail (correct=False). +Identify "universally hard" problems (failed by ≥80% of models on the variant) +and "universally easy" (correct by ≥80% on the variant). Then check whether +the universally hard *flip set* is dominated by certain topics, problem types, +or years. + +Outputs: +- Per-variant histogram of failure counts +- "Universal flip" cases: original correct by ≥80% of models, variant wrong by ≥80% +- These are the cleanest signals of variant-induced fragility because they + rule out problem-specific quirks +""" +from __future__ import annotations +import json +import sys +from pathlib import Path +from collections import defaultdict, Counter + +THIS_DIR = Path(__file__).resolve().parent +sys.path.insert(0, str(THIS_DIR)) +from structural_overlap import find_variant_file, load_problems, RESULTS_DIR, SURFACE_VARIANTS + +DATASET_DIR = Path("/home/yurenh2/gap/putnam-bench-anon/dataset") + + +def load_metadata(): + """Load problem-level metadata: type, tag, difficulty, year.""" + out = {} + for f in sorted(DATASET_DIR.glob("*.json")): + d = json.load(open(f)) + idx = d.get("index") + out[idx] = { + "type": d.get("type"), + "tag": d.get("tag"), + "difficulty": d.get("difficulty"), + "problem_type": d.get("problem_type"), + "year": int(idx.split("-")[0]) if idx and "-" in idx else None, + } + return out + + +def main(): + base = RESULTS_DIR + models = sorted([d.name for d in base.iterdir() if d.is_dir()]) + print(f"Loading {len(models)} models ...") + metadata = load_metadata() + + # correct_table[(variant, idx)][model] = bool + correct_table = defaultdict(dict) + for m in models: + mdir = base / m + for v in ["original"] + SURFACE_VARIANTS + ["kernel_variant"]: + vp = find_variant_file(mdir, v) + if not vp: + continue + for p in load_problems(vp): + idx = p.get("index") + correct = p.get("correct") + if idx is None or correct is None: + continue + correct_table[(v, idx)][m] = correct + + print(f"Loaded {len(correct_table)} (variant, problem) cells.\n") + + # Per-variant histogram of correct counts (out of N models) + print("=== HISTOGRAM OF CORRECT-COUNT ACROSS MODELS ===") + print("(How many models get each problem right per variant)\n") + print(f"{'Variant':<24} {'mean correct/N':>16} {'median':>9} {'#unanimous-fail':>17} {'#unanimous-pass':>17}") + print("-" * 90) + for v in ["original"] + SURFACE_VARIANTS + ["kernel_variant"]: + cells = [d for (vv, idx), d in correct_table.items() if vv == v] + if not cells: + continue + counts = [sum(1 for vv in cell.values() if vv) / len(cell) for cell in cells] + unanimous_fail = sum(1 for cell in cells if not any(cell.values()) and len(cell) >= 3) + unanimous_pass = sum(1 for cell in cells if all(cell.values()) and len(cell) >= 3) + import statistics + print(f"{v:<24} {statistics.fmean(counts)*100:>14.1f}% {statistics.median(counts)*100:>7.1f}% " + f"{unanimous_fail:>17} {unanimous_pass:>17}") + + # Universal flip cases: original correct by ≥80% of models, variant wrong by ≥80% + print(f"\n\n=== UNIVERSAL FLIP CASES (orig ≥80% correct, variant ≥80% wrong) ===\n") + print("These are the cleanest signals of variant-induced fragility.\n") + print(f"{'Variant':<24} {'# universal flips':>20}") + print("-" * 50) + universal_flips = defaultdict(list) + for v in SURFACE_VARIANTS + ["kernel_variant"]: + for idx in {ii for (vv, ii) in correct_table if vv == "original"}: + orig_cell = correct_table.get(("original", idx), {}) + var_cell = correct_table.get((v, idx), {}) + common = set(orig_cell) & set(var_cell) + if len(common) < 5: continue + orig_rate = sum(1 for m in common if orig_cell[m]) / len(common) + var_rate = sum(1 for m in common if var_cell[m]) / len(common) + if orig_rate >= 0.80 and var_rate <= 0.20: + universal_flips[v].append((idx, orig_rate, var_rate)) + print(f"{v:<24} {len(universal_flips[v]):>20}") + + # Topic / problem_type / difficulty / year breakdown for universal flips + print(f"\n\n=== TOPIC BREAKDOWN OF UNIVERSAL FLIPS ===\n") + for v in SURFACE_VARIANTS + ["kernel_variant"]: + if not universal_flips[v]: continue + print(f"--- {v} ({len(universal_flips[v])} universal flips) ---") + topics = Counter() + ptypes = Counter() + difficulties = Counter() + years = Counter() + for idx, _, _ in universal_flips[v]: + md = metadata.get(idx, {}) + tag = md.get("tag") + # tag may be a list (multi-tag) or a string + if isinstance(tag, list): + for t in tag: topics[t] += 1 + elif tag: + topics[tag] += 1 + else: + topics["?"] += 1 + ptypes[md.get("problem_type") or "?"] += 1 + diff = md.get("difficulty") + if isinstance(diff, list): diff = diff[0] if diff else "?" + difficulties[diff or "?"] += 1 + year = md.get("year") + if year: + # Bin years by decade + decade = (year // 10) * 10 + years[f"{decade}s"] += 1 + print(f" topics: {dict(topics.most_common(8))}") + print(f" problem_type:{dict(ptypes)}") + print(f" difficulties:{dict(difficulties.most_common(6))}") + print(f" decades: {dict(sorted(years.items()))}") + print() + + # Save universal flips for later analysis + out = {v: [{"index": idx, "orig_rate": o, "var_rate": vr} + for (idx, o, vr) in flips] + for v, flips in universal_flips.items()} + json.dump(out, open(THIS_DIR / "universal_flips.json", "w"), indent=2) + print(f"\nSaved -> analysis/universal_flips.json") + + # Topic-stratified analysis: failure rate per topic per variant + print(f"\n\n=== ACCURACY BY TOPIC × VARIANT (mean across models) ===\n") + by_topic_variant = defaultdict(lambda: defaultdict(list)) + for (v, idx), cell in correct_table.items(): + md = metadata.get(idx, {}) + tag = md.get("tag") + if not tag or len(cell) < 5: continue + # If multiple tags, attribute the same rate to each — keeps it simple + topics_for_problem = tag if isinstance(tag, list) else [tag] + rate = sum(1 for vv in cell.values() if vv) / len(cell) + for t in topics_for_problem: + by_topic_variant[t][v].append(rate) + + topics_to_show = ["ALG", "ANA", "NT", "COMB", "GEO"] + print(f"{'Topic':<8}", end="") + for v in ["original"] + SURFACE_VARIANTS + ["kernel_variant"]: + short = {"original":"orig","descriptive_long":"DL","descriptive_long_confusing":"DLC", + "descriptive_long_misleading":"DLM","garbled_string":"GS","kernel_variant":"KV"}[v] + print(f" {short:>5}", end="") + print(" Δ_orig→KV") + print("-" * 70) + for t in topics_to_show: + if t not in by_topic_variant: continue + row = by_topic_variant[t] + if "original" not in row: continue + orig_mean = statistics.fmean(row["original"]) * 100 + print(f"{t:<8}", end="") + for v in ["original"] + SURFACE_VARIANTS + ["kernel_variant"]: + if v in row: + m = statistics.fmean(row[v]) * 100 + print(f" {m:>4.1f}%", end="") + else: + print(f" {'-':>5}", end="") + kv_mean = statistics.fmean(row.get("kernel_variant", [0])) * 100 + print(f" {kv_mean - orig_mean:+5.1f}pp") + + +if __name__ == "__main__": + main() diff --git a/analysis/kv_overlap.py b/analysis/kv_overlap.py new file mode 100644 index 0000000..137e61f --- /dev/null +++ b/analysis/kv_overlap.py @@ -0,0 +1,332 @@ +"""Kernel-variant structural-overlap analysis (label-free). + +Unlike surface variants, kernel variants change the math, so we cannot use the +model's own original-correct trajectory as a reference. Instead we use the +dataset's canonical kernel-variant solution as the reference. + +Hypothesis: stable (correct on KV) trajectories have higher structural overlap +with the canonical KV solution than brittle (wrong on KV) trajectories. + +For comparability we also recompute the surface analyses using the same +'overlap with canonical solution' metric, so we can compare apples-to-apples +the magnitude of stable-vs-brittle gap between surface and kernel. +""" +from __future__ import annotations +import json +import os +import statistics +from pathlib import Path +from collections import defaultdict +from typing import Optional + +# Reuse helpers from the sibling module +import sys +sys.path.insert(0, str(Path(__file__).parent)) +from structural_overlap import ( + DATASET_DIR, RESULTS_DIR, + load_problems, find_variant_file, + canonicalize_text, normalize_whitespace, + tokens, bigrams, jaccard, extract_math_blocks, + metric_token_jaccard, metric_bigram_jaccard, + metric_directional_coverage, metric_equation_jaccard, + mann_whitney_u, bootstrap_ci_cohens_d, + is_collapse, COLLAPSE_MIN_CHARS, COLLAPSE_RATIO, + SURFACE_VARIANTS, +) + + +def load_dataset_variant_solutions() -> dict: + """Returns: {problem_index: {variant_name: canonical_solution_text}}. + + Includes 'original' (from top-level field) plus all 5 variants. + """ + out = {} + for f in sorted(DATASET_DIR.glob("*.json")): + d = json.load(open(f)) + idx = d.get("index") + cell = {"original": d.get("solution") or "", + "_problem_type": d.get("problem_type")} + for v, vd in d.get("variants", {}).items(): + if isinstance(vd, dict): + cell[v] = vd.get("solution") or "" + out[idx] = cell + return out + + +def load_dataset_maps() -> dict: + """Mirrors structural_overlap.load_dataset_maps but localized for safety.""" + out = {} + for f in sorted(DATASET_DIR.glob("*.json")): + d = json.load(open(f)) + idx = d.get("index") + variants = d.get("variants", {}) + cell = {} + for v in SURFACE_VARIANTS: + vd = variants.get(v, {}) + mp_str = vd.get("map") + if isinstance(mp_str, str): + try: + mp = eval(mp_str, {"__builtins__": {}}, {}) + if isinstance(mp, dict): + cell[v] = {str(k): str(v) for k, v in mp.items()} + except Exception: + pass + elif isinstance(mp_str, dict): + cell[v] = {str(k): str(v) for k, v in mp_str.items()} + out[idx] = cell + return out + + +# ---------- Cell analyzer ---------- + +def analyze_kv_cell(model_name: str, model_dir: Path, + canonical_solutions: dict) -> Optional[dict]: + """Compare model's KV trajectory to dataset canonical KV solution. + + No canonicalization (no rename map for KV — variables match by construction). + """ + orig_path = find_variant_file(model_dir, "original") + var_path = find_variant_file(model_dir, "kernel_variant") + if not orig_path or not var_path: + return None + orig_by = {p["index"]: p for p in load_problems(orig_path)} + var_by = {p["index"]: p for p in load_problems(var_path)} + + pairs_stable_drift = [] + pairs_brittle_drift = [] + n_brittle_collapse = 0 + n_stable_collapse = 0 + + for idx in set(orig_by) & set(var_by): + po, pv = orig_by[idx], var_by[idx] + if po.get("correct") is not True: + continue # Restrict to "model already gets the original" + var_correct = pv.get("correct") + if var_correct is None: + continue + var_text = (pv.get("solve") or {}).get("solution") or "" + if not var_text: + continue + canon_kv = canonical_solutions.get(idx, {}).get("kernel_variant", "") + if not canon_kv or len(canon_kv) < 200: + continue + # Collapse rule: variant text < 200 chars OR < 25% of canonical solution + collapse = (len(var_text) < COLLAPSE_MIN_CHARS or + len(var_text) < COLLAPSE_RATIO * len(canon_kv)) + sample = {"index": idx, "var_text": var_text, "canon": canon_kv} + if var_correct is True: + if collapse: + n_stable_collapse += 1 + else: + pairs_stable_drift.append(sample) + else: + if collapse: + n_brittle_collapse += 1 + else: + pairs_brittle_drift.append(sample) + + if not pairs_stable_drift or not pairs_brittle_drift: + return None + + metrics = { + "token_jaccard": metric_token_jaccard, + "bigram_jaccard": metric_bigram_jaccard, + "equation_jaccard": metric_equation_jaccard, + "directional_coverage": metric_directional_coverage, + } + + out = { + "model": model_name, + "variant": "kernel_variant", + "n_stable_drift": len(pairs_stable_drift), + "n_brittle_drift": len(pairs_brittle_drift), + "n_brittle_collapse": n_brittle_collapse, + "n_stable_collapse": n_stable_collapse, + "brittle_collapse_rate": n_brittle_collapse / + max(1, n_brittle_collapse + len(pairs_brittle_drift)), + "metrics": {}, + } + for mname, mfn in metrics.items(): + s_vals = [mfn(p["var_text"], p["canon"]) for p in pairs_stable_drift] + b_vals = [mfn(p["var_text"], p["canon"]) for p in pairs_brittle_drift] + U, p = mann_whitney_u(s_vals, b_vals) + sm, bm = statistics.fmean(s_vals), statistics.fmean(b_vals) + ssd = statistics.pstdev(s_vals) if len(s_vals) > 1 else 0 + bsd = statistics.pstdev(b_vals) if len(b_vals) > 1 else 0 + pooled = (((len(s_vals)-1)*ssd**2 + (len(b_vals)-1)*bsd**2) + / max(1, len(s_vals)+len(b_vals)-2)) ** 0.5 + d = (sm - bm) / pooled if pooled > 0 else 0.0 + out["metrics"][mname] = { + "stable_median": statistics.median(s_vals), + "stable_mean": sm, + "brittle_median": statistics.median(b_vals), + "brittle_mean": bm, + "delta_median": statistics.median(s_vals) - statistics.median(b_vals), + "cohens_d": d, + "U": U, + "p_two_sided": p, + } + # Headline bootstrap + s_vals = [metric_token_jaccard(p["var_text"], p["canon"]) for p in pairs_stable_drift] + b_vals = [metric_token_jaccard(p["var_text"], p["canon"]) for p in pairs_brittle_drift] + d_lo, d_hi = bootstrap_ci_cohens_d(s_vals, b_vals, n_iter=400) + out["metrics"]["token_jaccard"]["cohens_d_ci"] = [d_lo, d_hi] + return out + + +# ---------- Surface re-analysis with canonical reference ---------- + +def analyze_surface_cell_against_canonical(model_name: str, variant: str, + model_dir: Path, + canonical_solutions: dict) -> Optional[dict]: + """Compare model variant trajectory to dataset canonical variant solution. + + For comparability with KV. No rename canonicalization needed since both + sides use the same variant naming. + """ + var_path = find_variant_file(model_dir, variant) + orig_path = find_variant_file(model_dir, "original") + if not var_path or not orig_path: + return None + var_by = {p["index"]: p for p in load_problems(var_path)} + orig_by = {p["index"]: p for p in load_problems(orig_path)} + + pairs_stable, pairs_brittle = [], [] + n_brittle_collapse = 0 + for idx in set(var_by): + if idx not in orig_by: + continue + if orig_by[idx].get("correct") is not True: + continue # restrict to model-knows-original + pv = var_by[idx] + var_correct = pv.get("correct") + if var_correct is None: + continue + var_text = (pv.get("solve") or {}).get("solution") or "" + if not var_text: + continue + canon_var = canonical_solutions.get(idx, {}).get(variant, "") + if not canon_var or len(canon_var) < 200: + continue + if (len(var_text) < COLLAPSE_MIN_CHARS or + len(var_text) < COLLAPSE_RATIO * len(canon_var)): + if var_correct is False: + n_brittle_collapse += 1 + continue + sample = {"index": idx, "var_text": var_text, "canon": canon_var} + if var_correct is True: + pairs_stable.append(sample) + else: + pairs_brittle.append(sample) + + if not pairs_stable or not pairs_brittle: + return None + + metrics = { + "token_jaccard": metric_token_jaccard, + "bigram_jaccard": metric_bigram_jaccard, + "equation_jaccard": metric_equation_jaccard, + "directional_coverage": metric_directional_coverage, + } + out = { + "model": model_name, + "variant": variant, + "n_stable_drift": len(pairs_stable), + "n_brittle_drift": len(pairs_brittle), + "n_brittle_collapse": n_brittle_collapse, + "brittle_collapse_rate": n_brittle_collapse / + max(1, n_brittle_collapse + len(pairs_brittle)), + "metrics": {}, + } + for mname, mfn in metrics.items(): + s_vals = [mfn(p["var_text"], p["canon"]) for p in pairs_stable] + b_vals = [mfn(p["var_text"], p["canon"]) for p in pairs_brittle] + U, p = mann_whitney_u(s_vals, b_vals) + sm, bm = statistics.fmean(s_vals), statistics.fmean(b_vals) + ssd = statistics.pstdev(s_vals) if len(s_vals) > 1 else 0 + bsd = statistics.pstdev(b_vals) if len(b_vals) > 1 else 0 + pooled = (((len(s_vals)-1)*ssd**2 + (len(b_vals)-1)*bsd**2) + / max(1, len(s_vals)+len(b_vals)-2)) ** 0.5 + d = (sm - bm) / pooled if pooled > 0 else 0.0 + out["metrics"][mname] = { + "stable_median": statistics.median(s_vals), + "stable_mean": sm, + "brittle_median": statistics.median(b_vals), + "brittle_mean": bm, + "delta_median": statistics.median(s_vals) - statistics.median(b_vals), + "cohens_d": d, + "U": U, + "p_two_sided": p, + } + return out + + +def main(): + print("Loading canonical solutions ...") + canon = load_dataset_variant_solutions() + print(f" loaded {len(canon)} problems") + + all_models = sorted([d.name for d in RESULTS_DIR.iterdir() if d.is_dir()]) + + kv_results = [] + surface_results = [] + + print(f"\n{'KERNEL VARIANT — variant trajectory vs canonical KV solution':<70}") + print(f"{'Cell':<32} {'nSd':>4} {'nBd':>4} {'col%':>5} " + f"{'sMed':>6} {'bMed':>6} {'d':>6} {'p':>9}") + print("-" * 90) + for m in all_models: + mdir = RESULTS_DIR / m + if not mdir.exists(): + continue + res = analyze_kv_cell(m, mdir, canon) + if res is None: + continue + kv_results.append(res) + md = res["metrics"]["token_jaccard"] + print(f"{m+' / KV':<32} {res['n_stable_drift']:>4} {res['n_brittle_drift']:>4} " + f"{res['brittle_collapse_rate']*100:>4.0f}% " + f"{md['stable_median']:>6.3f} {md['brittle_median']:>6.3f} " + f"{md['cohens_d']:>+6.2f} {md['p_two_sided']:>9.1e}") + + print(f"\n{'SURFACE VARIANT — variant trajectory vs canonical variant solution':<70}") + print(f"{'Cell':<46} {'nSd':>4} {'nBd':>4} {'col%':>5} " + f"{'sMed':>6} {'bMed':>6} {'d':>6} {'p':>9}") + print("-" * 95) + for m in all_models: + mdir = RESULTS_DIR / m + if not mdir.exists(): + continue + for v in SURFACE_VARIANTS: + res = analyze_surface_cell_against_canonical(m, v, mdir, canon) + if res is None: + continue + surface_results.append(res) + md = res["metrics"]["token_jaccard"] + print(f"{m+' / '+v:<46} {res['n_stable_drift']:>4} {res['n_brittle_drift']:>4} " + f"{res['brittle_collapse_rate']*100:>4.0f}% " + f"{md['stable_median']:>6.3f} {md['brittle_median']:>6.3f} " + f"{md['cohens_d']:>+6.2f} {md['p_two_sided']:>9.1e}") + + # Save + json.dump(kv_results, open("/home/yurenh2/gap/analysis/kv_overlap_results.json", "w"), indent=2) + json.dump(surface_results, open("/home/yurenh2/gap/analysis/surface_canonical_results.json", "w"), indent=2) + + # Aggregate compare + print("\n" + "=" * 80) + print("AGGREGATE: surface (vs canonical) vs kernel (vs canonical)") + print("=" * 80) + for tag, results in [("surface", surface_results), ("kernel", kv_results)]: + ds = [c["metrics"]["token_jaccard"]["cohens_d"] for c in results] + ps = [c["metrics"]["token_jaccard"]["p_two_sided"] for c in results] + col = [c["brittle_collapse_rate"] for c in results] + if not ds: + continue + print(f"{tag:<8} cells={len(ds):>3} d_pos={sum(1 for d in ds if d>0):>3}/{len(ds):<3} " + f"p<.05={sum(1 for p in ps if p<0.05):>3}/{len(ps):<3} " + f"d_med={statistics.median(ds):+.2f} d_mean={statistics.fmean(ds):+.2f} " + f"collapse_mean={statistics.fmean(col)*100:.1f}%") + + +if __name__ == "__main__": + main() diff --git a/analysis/make_figures.py b/analysis/make_figures.py new file mode 100644 index 0000000..4ff598d --- /dev/null +++ b/analysis/make_figures.py @@ -0,0 +1,272 @@ +"""Three rebuttal figures. + +Fig1 — Structural Cohen's d heatmap + 18 models × 5 variants (4 surface + KV). + Surface cells use the self-anchor metric (model's own original under + inverse rename). KV uses the canonical-anchor metric. + +Fig2 — Rescue rebound rates by variant + condition + Pooled across 4 models. Bar plot with Wilson 95 % CI. + Three bars per variant: null / canonical_T2 / own_T2 (KV: only 2). + +Fig3 — own_T2 vs canonical_T2 per (model, variant) + Scatter plot of own_T2 rebound rate vs canonical_T2 rebound rate per + cell, with the y=x line. Points above the diagonal: own outperforms + canonical (rare); below: canonical outperforms own (typical). +""" +from __future__ import annotations +import json +import math +import statistics +from pathlib import Path +from collections import defaultdict + +import matplotlib +matplotlib.use("Agg") +import matplotlib.pyplot as plt +import numpy as np + +ROOT = Path("/home/yurenh2/gap/analysis") +FIG_DIR = ROOT / "figures" +FIG_DIR.mkdir(parents=True, exist_ok=True) + +VARIANT_LABELS = { + "descriptive_long": "DL", + "descriptive_long_confusing": "DLC", + "descriptive_long_misleading": "DLM", + "garbled_string": "GS", + "kernel_variant": "KV", +} +VARIANT_ORDER_SURF = ["descriptive_long", "descriptive_long_confusing", + "descriptive_long_misleading", "garbled_string"] +VARIANT_ORDER_ALL = VARIANT_ORDER_SURF + ["kernel_variant"] + +# ---------------------------------------------------------------------- +# Fig 1 — Structural Cohen's d heatmap +# ---------------------------------------------------------------------- + +def fig1_structural_d_heatmap(): + """Heatmap of Cohen's d for the stable-vs-brittle structural metric. + + Surface cells: self-anchor (token Jaccard between model's variant + trajectory and its own original-correct trajectory after canonicalization). + Source file: structural_overlap_results.json. + + KV cells: canonical-anchor (token Jaccard between model's KV trajectory and + the dataset's canonical KV solution). + Source file: kv_overlap_results.json. + """ + surf = json.load(open(ROOT / "structural_overlap_results.json")) + kv = json.load(open(ROOT / "kv_overlap_results.json")) + + # Build matrix: rows = models (sorted by mean d), cols = variants (DL, DLC, DLM, GS, KV) + by_cell = {} + for c in surf: + by_cell[(c["model"], c["variant"])] = c["metrics"]["token_jaccard"]["cohens_d"] + for c in kv: + by_cell[(c["model"], "kernel_variant")] = c["metrics"]["token_jaccard"]["cohens_d"] + + models = sorted({k[0] for k in by_cell}) + # Sort by mean d across surface variants only (so KV doesn't bias the order) + def mean_surface_d(m): + ds = [by_cell.get((m, v)) for v in VARIANT_ORDER_SURF + if by_cell.get((m, v)) is not None] + return statistics.fmean(ds) if ds else 0.0 + models.sort(key=mean_surface_d, reverse=True) + + M = np.full((len(models), len(VARIANT_ORDER_ALL)), np.nan) + for i, m in enumerate(models): + for j, v in enumerate(VARIANT_ORDER_ALL): + d = by_cell.get((m, v)) + if d is not None: + M[i, j] = d + + fig, ax = plt.subplots(figsize=(7, 9)) + vmin = 0.0 + vmax = 1.4 + cmap = plt.cm.viridis + im = ax.imshow(M, cmap=cmap, vmin=vmin, vmax=vmax, aspect="auto") + ax.set_xticks(range(len(VARIANT_ORDER_ALL))) + ax.set_xticklabels([VARIANT_LABELS[v] for v in VARIANT_ORDER_ALL]) + ax.set_yticks(range(len(models))) + ax.set_yticklabels(models, fontsize=9) + # Annotate values + for i in range(len(models)): + for j in range(len(VARIANT_ORDER_ALL)): + v = M[i, j] + if not math.isnan(v): + color = "white" if v < 0.7 else "black" + ax.text(j, i, f"{v:+.2f}", ha="center", va="center", + fontsize=8, color=color) + # Vertical line separating surface from KV + ax.axvline(x=3.5, color="white", linewidth=2) + cbar = plt.colorbar(im, ax=ax, fraction=0.046, pad=0.04) + cbar.set_label("Cohen's d (stable − brittle)\non canonicalized token Jaccard", + fontsize=9) + ax.set_title("Structural overlap effect size: stable vs brittle\n" + "(surface = self-anchor; KV = canonical-anchor)", + fontsize=11) + ax.set_xlabel("Variant family", fontsize=10) + plt.tight_layout() + out = FIG_DIR / "fig1_structural_d_heatmap.png" + plt.savefig(out, dpi=200, bbox_inches="tight") + plt.close() + print(f"Saved {out}") + + +# ---------------------------------------------------------------------- +# Fig 2 — Rescue rebound rates with Wilson CI +# ---------------------------------------------------------------------- + +def wilson_ci(k: int, n: int, z: float = 1.96): + if n == 0: + return (0.0, 0.0, 0.0) + p = k / n + denom = 1 + z * z / n + center = (p + z * z / (2 * n)) / denom + half = z * math.sqrt(p * (1 - p) / n + z * z / (4 * n * n)) / denom + return (p, max(0.0, center - half), min(1.0, center + half)) + + +def fig2_rescue_rates(): + rows = [json.loads(l) for l in open(ROOT / "rescue_results/rescue_30.jsonl")] + + counts = defaultdict(lambda: {"k": 0, "n": 0}) + for r in rows: + counts[(r["variant"], r["condition"])]["n"] += 1 + if r.get("grade") == "CORRECT": + counts[(r["variant"], r["condition"])]["k"] += 1 + + conds_full = ["null", "canonical_T2", "own_T2"] + cond_color = {"null": "#888888", "canonical_T2": "#1f77b4", "own_T2": "#d62728"} + cond_label = {"null": "null (generic scaffold)", + "canonical_T2": "canonical_T2 (item-specific, expert prose)", + "own_T2": "own_T2 (item-specific, model's own work, renamed)"} + + fig, ax = plt.subplots(figsize=(8, 5)) + n_var = len(VARIANT_ORDER_ALL) + width = 0.27 + x = np.arange(n_var) + for ci, cond in enumerate(conds_full): + ks, lows, highs, ps = [], [], [], [] + for v in VARIANT_ORDER_ALL: + d = counts.get((v, cond)) + if d is None: + ks.append(0); lows.append(0); highs.append(0); ps.append(0) + continue + p, lo, hi = wilson_ci(d["k"], d["n"]) + ps.append(p * 100) + lows.append((p - lo) * 100) + highs.append((hi - p) * 100) + ks.append(d["k"]) + offset = (ci - 1) * width + ax.bar(x + offset, ps, width=width, color=cond_color[cond], label=cond_label[cond], + yerr=[lows, highs], capsize=3, error_kw={"elinewidth": 1, "ecolor": "#444444"}) + # Annotate counts above each bar + for xi, p, k in zip(x + offset, ps, ks): + if k > 0: + ax.text(xi, p + 0.5, f"{p:.0f}%", ha="center", va="bottom", fontsize=8) + + ax.set_xticks(x) + ax.set_xticklabels([VARIANT_LABELS[v] for v in VARIANT_ORDER_ALL], fontsize=10) + ax.set_ylabel("Rebound rate (%) on flip cases", fontsize=10) + ax.set_title("Repairability rescue: rebound rate by variant and prefix condition\n" + "(pooled across 4 models, n ≈ 100–120 per cell, 95% Wilson CI)", + fontsize=11) + ax.set_ylim(0, 60) + ax.legend(loc="upper right", fontsize=8, framealpha=0.95) + ax.grid(axis="y", linestyle="--", alpha=0.4) + ax.set_axisbelow(True) + plt.tight_layout() + out = FIG_DIR / "fig2_rescue_rebound.png" + plt.savefig(out, dpi=200, bbox_inches="tight") + plt.close() + print(f"Saved {out}") + + +# ---------------------------------------------------------------------- +# Fig 3 — own_T2 vs canonical_T2 scatter +# ---------------------------------------------------------------------- + +def fig3_own_vs_canonical_scatter(): + rows = [json.loads(l) for l in open(ROOT / "rescue_results/rescue_30.jsonl")] + + counts = defaultdict(lambda: {"k": 0, "n": 0}) + for r in rows: + counts[(r["model"], r["variant"], r["condition"])]["n"] += 1 + if r.get("grade") == "CORRECT": + counts[(r["model"], r["variant"], r["condition"])]["k"] += 1 + + fig, ax = plt.subplots(figsize=(7, 7)) + + models_in_data = sorted({k[0] for k in counts}) + model_color = { + "claude-sonnet-4": "#ff7f0e", + "gemini-2.5-flash": "#2ca02c", + "gpt-4.1-mini": "#1f77b4", + "gpt-4o-mini": "#d62728", + } + var_marker = { + "descriptive_long": "o", + "descriptive_long_confusing": "s", + "descriptive_long_misleading": "^", + "garbled_string": "D", + } + + # Diagonal + ax.plot([0, 0.7], [0, 0.7], "k--", lw=1, alpha=0.5) + ax.text(0.62, 0.66, "y = x", fontsize=8, alpha=0.6) + + for m in models_in_data: + for v in VARIANT_ORDER_SURF: + own = counts.get((m, v, "own_T2")) + can = counts.get((m, v, "canonical_T2")) + if own is None or can is None or own["n"] == 0 or can["n"] == 0: + continue + x = can["k"] / can["n"] + y = own["k"] / own["n"] + ax.scatter(x, y, s=110, c=model_color.get(m, "gray"), + marker=var_marker[v], alpha=0.85, + edgecolors="black", linewidths=0.6) + + # Build legend + from matplotlib.lines import Line2D + model_handles = [Line2D([], [], marker="o", linestyle="", markersize=9, + markerfacecolor=c, markeredgecolor="black", + markeredgewidth=0.6, label=m) + for m, c in model_color.items() if m in models_in_data] + variant_handles = [Line2D([], [], marker=mk, linestyle="", markersize=9, + markerfacecolor="lightgray", markeredgecolor="black", + markeredgewidth=0.6, label=VARIANT_LABELS[v]) + for v, mk in var_marker.items()] + leg1 = ax.legend(handles=model_handles, loc="upper left", title="Model", + fontsize=8, title_fontsize=9, framealpha=0.95) + ax.add_artist(leg1) + ax.legend(handles=variant_handles, loc="lower right", title="Variant", + fontsize=8, title_fontsize=9, framealpha=0.95) + + ax.set_xlim(0, 0.7) + ax.set_ylim(0, 0.7) + ax.set_xlabel("canonical_T2 rebound rate", fontsize=10) + ax.set_ylabel("own_T2 rebound rate", fontsize=10) + ax.set_title("Per-cell rescue rates: model's own prefix vs canonical prefix\n" + "(below diagonal = canonical wins; gpt-4o-mini is the only family above)", + fontsize=11) + ax.grid(linestyle="--", alpha=0.4) + ax.set_axisbelow(True) + plt.tight_layout() + out = FIG_DIR / "fig3_own_vs_canonical_scatter.png" + plt.savefig(out, dpi=200, bbox_inches="tight") + plt.close() + print(f"Saved {out}") + + +def main(): + fig1_structural_d_heatmap() + fig2_rescue_rates() + fig3_own_vs_canonical_scatter() + print("\nAll figures written to:", FIG_DIR) + + +if __name__ == "__main__": + main() diff --git a/analysis/normalization_analysis.py b/analysis/normalization_analysis.py new file mode 100644 index 0000000..8fb4f48 --- /dev/null +++ b/analysis/normalization_analysis.py @@ -0,0 +1,189 @@ +"""Quantify spontaneous variant->canonical name normalization in own_T2 outputs. + +For each own_T2 case, check whether the model's student_solution preserves the +variant variable names from its prefix or normalizes them back to the canonical +names from the dataset's rename map. + +For each variant variable name in the rename map: +- count its occurrences in the prefix (as injected) +- count its occurrences in the model's student_solution +- count occurrences of the corresponding CANONICAL name in the student_solution + +If the model preserves variant naming: variant_name count in solution should be +proportionally similar to the prefix count. +If the model normalizes back: canonical_name count in solution should rise while +variant_name count drops. +""" +from __future__ import annotations +import json +import re +import sys +from pathlib import Path +from collections import defaultdict +import statistics + +THIS_DIR = Path(__file__).resolve().parent +sys.path.insert(0, str(THIS_DIR)) +from rescue_runner import load_dataset_full, find_flip_cases, build_case_prompts + +PILOT_PATH = Path("/home/yurenh2/gap/analysis/rescue_results/rescue_5.jsonl") + + +def count_word(text: str, word: str) -> int: + """Whole-word count of `word` in `text`.""" + if not text or not word: + return 0 + pat = r"(?<![A-Za-z0-9_])" + re.escape(word) + r"(?![A-Za-z0-9_])" + return len(re.findall(pat, text)) + + +def analyze_one(row: dict, ds_cell: dict) -> dict: + """For one own_T2 row, compute name preservation stats.""" + variant = row["variant"] + var_info = ds_cell["variants"].get(variant, {}) + rmap = var_info.get("map") or {} + if not rmap: + return {} + student = row.get("student_solution") or "" + if not student: + return {} + + # Build the prefix that the model was given + case = { + "index": row["index"], + "problem_type": row["problem_type"], + "orig_solution": "", + "orig_final_answer": "", + } + # We need the original solution from results_new to reconstruct the prefix. + # Use find_flip_cases to recover it cleanly. + cases = find_flip_cases(row["model"], variant, 100) + matched = next((c for c in cases if c["index"] == row["index"]), None) + if matched is None: + return {} + prompts = build_case_prompts(matched, variant, ds_cell) + own_prompt = prompts.get("own_T2", "") + if "PARTIAL WORK" not in own_prompt: + return {} + # Extract just the partial work text + section = own_prompt.split("PARTIAL WORK")[1].split("Provide a complete")[0] + section = section.split("(to copy verbatim")[1] if "(to copy verbatim" in section else section + section = section.split("):", 1)[1] if "):" in section else section + prefix = section.strip() + + # For each variant variable, count occurrences in prefix and in student + per_var = {} + for canon_name, var_name in rmap.items(): + if not var_name: + continue + prefix_v = count_word(prefix, var_name) + student_v = count_word(student, var_name) + student_c = count_word(student, canon_name) + # Only meaningful if the variant name actually appeared in the prefix + if prefix_v == 0: + continue + per_var[var_name] = { + "canon_name": canon_name, + "prefix_count_variant": prefix_v, + "student_count_variant": student_v, + "student_count_canonical": student_c, + # Preservation ratio: how much of the variant naming survived + # capped to 1.0 (model may use the variable many more times in + # its continuation, which inflates the count) + "preservation_ratio": min(1.0, student_v / max(1, prefix_v)), + "normalization_ratio": min(1.0, student_c / max(1, prefix_v)), + } + if not per_var: + return {} + # Aggregate per case: median preservation + pres_vals = [v["preservation_ratio"] for v in per_var.values()] + norm_vals = [v["normalization_ratio"] for v in per_var.values()] + return { + "model": row["model"], + "variant": variant, + "index": row["index"], + "grade": row.get("grade"), + "n_vars_in_prefix": len(per_var), + "median_preservation": statistics.median(pres_vals), + "median_normalization": statistics.median(norm_vals), + "mean_preservation": statistics.fmean(pres_vals), + "mean_normalization": statistics.fmean(norm_vals), + "per_var": per_var, + } + + +def main(): + print("Loading dataset ...") + ds = load_dataset_full() + print(f"Loaded {len(ds)} problems") + print(f"\nLoading pilot rows from {PILOT_PATH} ...") + rows = [json.loads(l) for l in open(PILOT_PATH)] + own_rows = [r for r in rows if r["condition"] == "own_T2"] + print(f" total rows: {len(rows)}, own_T2 rows: {len(own_rows)}") + + analyses = [] + skipped = 0 + for r in own_rows: + ds_cell = ds.get(r["index"]) + if ds_cell is None: + skipped += 1 + continue + a = analyze_one(r, ds_cell) + if a: + analyses.append(a) + else: + skipped += 1 + print(f" analyzed: {len(analyses)}, skipped: {skipped}") + + # Aggregate by variant + print("\n=== SPONTANEOUS NORMALIZATION (own_T2 condition only) ===\n") + print("Per case: median across variant variables of preservation ratio") + print("(higher = more variant naming preserved; lower = normalized back to canonical)") + print() + print(f"{'Variant':<32} {'n':>4} {'median_pres':>12} {'mean_pres':>10} " + f"{'median_norm':>12} {'mean_norm':>10}") + print("-" * 90) + by_variant = defaultdict(list) + for a in analyses: + by_variant[a["variant"]].append(a) + for v in sorted(by_variant): + cs = by_variant[v] + mp_vals = [c["median_preservation"] for c in cs] + mn_vals = [c["median_normalization"] for c in cs] + print(f"{v:<32} {len(cs):>4} " + f"{statistics.median(mp_vals):>12.3f} {statistics.fmean(mp_vals):>10.3f} " + f"{statistics.median(mn_vals):>12.3f} {statistics.fmean(mn_vals):>10.3f}") + + # Aggregate by model + print(f"\n{'Model':<22} {'n':>4} {'median_pres':>12} {'mean_pres':>10} " + f"{'median_norm':>12} {'mean_norm':>10}") + print("-" * 80) + by_model = defaultdict(list) + for a in analyses: + by_model[a["model"]].append(a) + for m in sorted(by_model): + cs = by_model[m] + mp_vals = [c["median_preservation"] for c in cs] + mn_vals = [c["median_normalization"] for c in cs] + print(f"{m:<22} {len(cs):>4} " + f"{statistics.median(mp_vals):>12.3f} {statistics.fmean(mp_vals):>10.3f} " + f"{statistics.median(mn_vals):>12.3f} {statistics.fmean(mn_vals):>10.3f}") + + # Effect of normalization on rebound: do cases that normalized more often FAIL? + print("\n=== RELATION TO REBOUND ===") + pass_pres = [a["median_preservation"] for a in analyses if a["grade"] == "CORRECT"] + fail_pres = [a["median_preservation"] for a in analyses if a["grade"] == "INCORRECT"] + print(f" median_preservation among rebound CORRECT (n={len(pass_pres)}): " + f"median={statistics.median(pass_pres):.3f} mean={statistics.fmean(pass_pres):.3f}") + print(f" median_preservation among rebound INCORRECT (n={len(fail_pres)}): " + f"median={statistics.median(fail_pres):.3f} mean={statistics.fmean(fail_pres):.3f}") + + # Save detailed results + out = Path("/home/yurenh2/gap/analysis/normalization_results.json") + json.dump([{k: v for k, v in a.items() if k != "per_var"} for a in analyses], + open(out, "w"), indent=2) + print(f"\nSaved -> {out}") + + +if __name__ == "__main__": + main() diff --git a/analysis/rescue_analyze.py b/analysis/rescue_analyze.py new file mode 100644 index 0000000..5fe97b6 --- /dev/null +++ b/analysis/rescue_analyze.py @@ -0,0 +1,161 @@ +"""Analyze full rescue results: per-cell rebound rates, Wilson CIs, McNemar.""" +from __future__ import annotations +import json +import math +import statistics +from collections import defaultdict +from pathlib import Path + +PATH = Path("/home/yurenh2/gap/analysis/rescue_results/rescue_30.jsonl") + + +def wilson_ci(k: int, n: int, z: float = 1.96) -> tuple: + if n == 0: + return (0.0, 0.0, 0.0) + p = k / n + denom = 1 + z * z / n + center = (p + z * z / (2 * n)) / denom + half = z * math.sqrt(p * (1 - p) / n + z * z / (4 * n * n)) / denom + return (p, max(0.0, center - half), min(1.0, center + half)) + + +def mcnemar_p(b: int, c: int) -> float: + """McNemar exact-ish p (binomial two-sided). b = treat A correct, B wrong; + c = treat A wrong, B correct. Returns p value testing b == c.""" + n = b + c + if n == 0: + return 1.0 + # Two-sided binomial test on min(b,c) ~ Bin(n, 0.5) + k = min(b, c) + # cumulative + cum = 0.0 + for i in range(k + 1): + cum += math.comb(n, i) * (0.5 ** n) + p = min(1.0, 2 * cum) + return p + + +def main(): + rows = [json.loads(l) for l in open(PATH)] + print(f"Loaded {len(rows)} rows") + + # Quick sanity + from collections import Counter + print("Solve status:", Counter(r.get("solve_status") for r in rows)) + print("Grade status:", Counter(r.get("grade_status") for r in rows)) + + # Per-cell counts + counts = defaultdict(lambda: {"total": 0, "correct": 0}) + for r in rows: + if r.get("grade_status") != "success" and r.get("grade") not in ("CORRECT", "INCORRECT"): + # Treat solve failures / parse failures as INCORRECT (conservative) + pass + key = (r["model"], r["variant"], r["condition"]) + counts[key]["total"] += 1 + if r.get("grade") == "CORRECT": + counts[key]["correct"] += 1 + + # Aggregated by (variant, condition) + by_var_cond = defaultdict(lambda: {"total": 0, "correct": 0}) + for (m, v, c), d in counts.items(): + by_var_cond[(v, c)]["total"] += d["total"] + by_var_cond[(v, c)]["correct"] += d["correct"] + + print("\n" + "=" * 90) + print("REBOUND RATE BY (VARIANT, CONDITION) [aggregated across 4 models]") + print("=" * 90) + print(f"{'Variant':<32} {'Condition':<14} {'k/n':>10} {'rate':>7} {'95% Wilson CI':>20}") + print("-" * 90) + variants_order = ["descriptive_long", "descriptive_long_confusing", + "descriptive_long_misleading", "garbled_string", "kernel_variant"] + conds_order = ["null", "canonical_T2", "own_T2"] + for v in variants_order: + for c in conds_order: + d = by_var_cond.get((v, c)) + if not d: + continue + p, lo, hi = wilson_ci(d["correct"], d["total"]) + print(f"{v:<32} {c:<14} {d['correct']:>4}/{d['total']:>4} " + f"{p*100:>5.1f}% [{lo*100:>5.1f}%, {hi*100:>5.1f}%]") + print() + + # Per-model aggregated by (variant, condition) + print("\n" + "=" * 90) + print("REBOUND RATE PER (MODEL, VARIANT, CONDITION)") + print("=" * 90) + models_order = sorted({k[0] for k in counts}) + print(f"{'Model':<22} {'Variant':<32} {'cond':<14} {'k/n':>10} {'rate':>7}") + for m in models_order: + for v in variants_order: + for c in conds_order: + d = counts.get((m, v, c)) + if not d: + continue + p, lo, hi = wilson_ci(d["correct"], d["total"]) + print(f" {m:<20} {v:<32} {c:<14} {d['correct']:>3}/{d['total']:>3} " + f"{p*100:>5.1f}%") + print() + + # Paired McNemar test: same case, different conditions + # Pair canonical_T2 vs null, and own_T2 vs null + print("\n" + "=" * 90) + print("PAIRED MCNEMAR TESTS") + print("=" * 90) + case_grades = defaultdict(dict) # (model, variant, index) -> {cond: grade} + for r in rows: + case_grades[(r["model"], r["variant"], r["index"])][r["condition"]] = r.get("grade") + + print("\ncanonical_T2 vs null:") + print(f" {'cell':<46} {'b (can-only)':>12} {'c (null-only)':>13} " + f"{'both-CORR':>10} {'both-INC':>10} {'McNemar p':>11}") + for m in models_order: + for v in variants_order: + b = c = both_corr = both_inc = 0 + for k, grds in case_grades.items(): + if k[0] != m or k[1] != v: continue + ca = grds.get("canonical_T2"); nu = grds.get("null") + if ca is None or nu is None: continue + if ca == "CORRECT" and nu == "INCORRECT": b += 1 + elif ca == "INCORRECT" and nu == "CORRECT": c += 1 + elif ca == "CORRECT" and nu == "CORRECT": both_corr += 1 + elif ca == "INCORRECT" and nu == "INCORRECT": both_inc += 1 + p = mcnemar_p(b, c) + print(f" {m+'/'+v:<46} {b:>12} {c:>13} {both_corr:>10} {both_inc:>10} {p:>11.3f}") + + print("\nown_T2 vs null:") + print(f" {'cell':<46} {'b (own-only)':>12} {'c (null-only)':>13} " + f"{'both-CORR':>10} {'both-INC':>10} {'McNemar p':>11}") + for m in models_order: + for v in [vv for vv in variants_order if vv != "kernel_variant"]: + b = c = both_corr = both_inc = 0 + for k, grds in case_grades.items(): + if k[0] != m or k[1] != v: continue + ow = grds.get("own_T2"); nu = grds.get("null") + if ow is None or nu is None: continue + if ow == "CORRECT" and nu == "INCORRECT": b += 1 + elif ow == "INCORRECT" and nu == "CORRECT": c += 1 + elif ow == "CORRECT" and nu == "CORRECT": both_corr += 1 + elif ow == "INCORRECT" and nu == "INCORRECT": both_inc += 1 + p = mcnemar_p(b, c) + print(f" {m+'/'+v:<46} {b:>12} {c:>13} {both_corr:>10} {both_inc:>10} {p:>11.3f}") + + print("\nown_T2 vs canonical_T2:") + print(f" {'cell':<46} {'b (own-only)':>12} {'c (can-only)':>13} " + f"{'both-CORR':>10} {'both-INC':>10} {'McNemar p':>11}") + for m in models_order: + for v in [vv for vv in variants_order if vv != "kernel_variant"]: + b = c = both_corr = both_inc = 0 + for k, grds in case_grades.items(): + if k[0] != m or k[1] != v: continue + ow = grds.get("own_T2"); ca = grds.get("canonical_T2") + if ow is None or ca is None: continue + if ow == "CORRECT" and ca == "INCORRECT": b += 1 + elif ow == "INCORRECT" and ca == "CORRECT": c += 1 + elif ow == "CORRECT" and ca == "CORRECT": both_corr += 1 + elif ow == "INCORRECT" and ca == "INCORRECT": both_inc += 1 + p = mcnemar_p(b, c) + print(f" {m+'/'+v:<46} {b:>12} {c:>13} {both_corr:>10} {both_inc:>10} {p:>11.3f}") + + +if __name__ == "__main__": + main() diff --git a/analysis/rescue_api.py b/analysis/rescue_api.py new file mode 100644 index 0000000..4641655 --- /dev/null +++ b/analysis/rescue_api.py @@ -0,0 +1,373 @@ +"""Async API caller for rescue experiment. + +Supports OpenAI, Anthropic, Google. All callers return a unified dict: + {"status": "success"|"failed", "content": str, "error": str|None} + +Concurrency is controlled per-provider via asyncio.Semaphore so we don't +saturate rate limits in any one provider. +""" +from __future__ import annotations +import asyncio +import json +import os +import random +from typing import Optional + +# ---------- Provider constants ---------- + +# Solver model -> provider mapping +SOLVER_PROVIDERS = { + "gpt-4.1-mini": "openai", + "gpt-4o-mini": "openai", + "claude-sonnet-4": "anthropic", + "gemini-2.5-flash": "google", +} + +# API model strings (the canonical IDs to send) +API_MODEL_NAMES = { + "gpt-4.1-mini": "gpt-4.1-mini", + "gpt-4o-mini": "gpt-4o-mini", + "claude-sonnet-4": "claude-sonnet-4-20250514", + "gemini-2.5-flash": "gemini-2.5-flash", +} + +GRADER_MODEL = "gpt-4o" +GRADER_PROVIDER = "openai" + +PER_PROVIDER_CONCURRENCY = { + "openai": 500, + "anthropic": 25, # 90k tok/min cap; 25 in flight keeps us comfortably under + "google": 300, +} + +DEFAULT_RETRIES = 6 +DEFAULT_BASE_TIMEOUT = 300.0 +RATE_LIMIT_BACKOFF_SECONDS = 60.0 # min sleep on rate limit hits + + +# ---------- Solver / grader prompts (consistent with paper) ---------- + +SOLVER_SYSTEM_PROMPT = """You are an expert mathematician solving competition-level problems. +Provide detailed, step-by-step solutions with clear mathematical reasoning. + +Requirements: +- Show all your work and intermediate steps +- Justify each major step of your reasoning +- Use proper mathematical notation +- Be thorough but concise +- State your final answer clearly + +Solve the problem completely and rigorously.""" + +PROOF_GRADER_SYSTEM_PROMPT = """You are an extremely strict mathematical grader evaluating competition-level PROOF problems. + +GRADING STANDARDS (BE VERY STRICT): +- Mathematical rigor: Every step must be mathematically sound and justified +- Logical flow: The reasoning must be clear, complete, and logically connected +- Correctness: All calculations, algebraic manipulations, and conclusions must be correct +- Completeness: The solution must address all parts of the problem fully +- Precision: Mathematical statements must be precise and unambiguous + +FAILING CRITERIA (Mark as INCORRECT if ANY of these apply): +- Any unjustified logical leap or gap in reasoning +- Any computational error, no matter how small +- Missing steps in critical parts of the argument +- Imprecise or ambiguous mathematical statements +- Incorrect final answer, even if approach is partially correct +- Circular reasoning or logical fallacies +- Misuse of mathematical theorems or definitions + +BE EXTREMELY STRICT. Competition mathematics proofs require perfect precision.""" + +CALCULATION_GRADER_SYSTEM_PROMPT = """You are a mathematical grader evaluating competition-level CALCULATION problems. + +GRADING STANDARDS FOR CALCULATION PROBLEMS: +- Primary focus: Is the final answer correct? +- Secondary focus: Is the overall approach reasonable and mathematically sound? +- Computation: Allow minor computational slips if the method is correct and final answer is right + +GRADING CRITERIA: +- CORRECT: Final answer is correct AND approach is fundamentally sound +- INCORRECT: Final answer is wrong OR approach is fundamentally flawed + +For calculation problems, the final numerical answer is the most important criterion. +Minor intermediate errors are acceptable if they don't affect the final result.""" + +PROOF_GRADER_USER_TEMPLATE = """Grade this PROOF solution with extreme strictness. + +PROBLEM: +{problem_statement} + +STUDENT SOLUTION: +{solution} + +CORRECT REFERENCE SOLUTION: +{reference_solution} + +Evaluate with maximum strictness. Every logical step must be perfect. Return JSON with: +{{"grade": "CORRECT" or "INCORRECT", + "detailed_feedback": "specific detailed analysis of what is right/wrong", + "major_issues": "list of significant mathematical errors or gaps", + "final_answer_correct": true or false, + "reasoning_rigor_score": 0-10 integer (10=perfect rigor, 0=severely flawed), + "overall_assessment": "comprehensive evaluation summary"}}""" + +CALCULATION_GRADER_USER_TEMPLATE = """Grade this CALCULATION solution with focus on final answer correctness. + +PROBLEM: +{problem_statement} + +STUDENT SOLUTION: +{solution} + +CORRECT REFERENCE SOLUTION: +{reference_solution} + +Focus primarily on whether the final answer is correct. Return JSON with: +{{"grade": "CORRECT" or "INCORRECT", + "detailed_feedback": "specific detailed analysis of what is right/wrong", + "major_issues": "list of significant mathematical errors or gaps", + "final_answer_correct": true or false, + "reasoning_rigor_score": 0-10 integer (10=perfect rigor, 0=severely flawed), + "overall_assessment": "comprehensive evaluation summary"}}""" + + +# ---------- Lazy client builders ---------- + +_openai_client = None +_anthropic_client = None +_google_client = None + +def _get_openai_client(): + global _openai_client + if _openai_client is None: + from openai import AsyncOpenAI + import httpx + limits = httpx.Limits(max_connections=2000, max_keepalive_connections=1000) + timeout = httpx.Timeout(timeout=DEFAULT_BASE_TIMEOUT, connect=30.0, + read=DEFAULT_BASE_TIMEOUT, write=30.0) + _openai_client = AsyncOpenAI(http_client=httpx.AsyncClient(limits=limits, timeout=timeout)) + return _openai_client + + +def _get_anthropic_client(): + global _anthropic_client + if _anthropic_client is None: + from anthropic import AsyncAnthropic + _anthropic_client = AsyncAnthropic() + return _anthropic_client + + +def _get_google_client(): + global _google_client + if _google_client is None: + from google import genai + _google_client = genai.Client(api_key=os.environ["GOOGLE_API_KEY"]) + return _google_client + + +# ---------- Per-provider call functions ---------- + +async def _call_openai(model: str, system: str, user: str, + temperature: float, max_tokens: int = 16000) -> dict: + client = _get_openai_client() + api_params = { + "model": model, + "messages": [ + {"role": "system", "content": system}, + {"role": "user", "content": user}, + ], + "max_tokens": max_tokens, + } + # o-series models force temperature=1 and don't accept max_tokens + if any(p in model.lower() for p in ["o1", "o3", "o4"]): + api_params.pop("max_tokens", None) + api_params["temperature"] = 1.0 + else: + api_params["temperature"] = temperature + api_params["response_format"] = {"type": "json_object"} + resp = await client.chat.completions.create(**api_params) + content = resp.choices[0].message.content or "" + return {"status": "success", "content": content, "error": None} + + +async def _call_anthropic(model: str, system: str, user: str, + temperature: float, max_tokens: int = 16000) -> dict: + client = _get_anthropic_client() + resp = await client.messages.create( + model=model, + system=system, + messages=[{"role": "user", "content": user}], + temperature=temperature, + max_tokens=max_tokens, + ) + content = "" + if resp.content: + for block in resp.content: + if hasattr(block, "text"): + content += block.text + return {"status": "success", "content": content, "error": None} + + +async def _call_google(model: str, system: str, user: str, + temperature: float, max_tokens: int = 16000) -> dict: + client = _get_google_client() + from google.genai.types import GenerateContentConfig + config = GenerateContentConfig( + system_instruction=system, + temperature=temperature, + max_output_tokens=max_tokens, + response_mime_type="application/json", + ) + resp = await client.aio.models.generate_content( + model=model, contents=user, config=config, + ) + content = resp.text or "" + return {"status": "success", "content": content, "error": None} + + +# ---------- Unified caller with retries and per-provider semaphore ---------- + +_provider_sems: dict = {} + +def _sem_for(provider: str) -> asyncio.Semaphore: + if provider not in _provider_sems: + _provider_sems[provider] = asyncio.Semaphore(PER_PROVIDER_CONCURRENCY[provider]) + return _provider_sems[provider] + + +async def call_model(model_short: str, system: str, user: str, + temperature: float = 0.0, max_tokens: int = 16000, + retries: int = DEFAULT_RETRIES) -> dict: + """Call any supported model by short alias. Includes retries.""" + if model_short == GRADER_MODEL: + provider = GRADER_PROVIDER + api_model = GRADER_MODEL + else: + provider = SOLVER_PROVIDERS[model_short] + api_model = API_MODEL_NAMES[model_short] + sem = _sem_for(provider) + + async with sem: + last_err = None + for attempt in range(retries): + try: + if provider == "openai": + return await _call_openai(api_model, system, user, temperature, max_tokens) + elif provider == "anthropic": + return await _call_anthropic(api_model, system, user, temperature, max_tokens) + elif provider == "google": + return await _call_google(api_model, system, user, temperature, max_tokens) + else: + return {"status": "failed", "content": "", + "error": f"unknown provider {provider}"} + except Exception as e: + last_err = e + err_str = str(e).lower() + # Longer backoff for rate-limit-style errors so the per-minute + # window has time to refill. + if "rate_limit" in err_str or "429" in err_str or "rate limit" in err_str: + await asyncio.sleep(RATE_LIMIT_BACKOFF_SECONDS + random.random() * 10) + else: + await asyncio.sleep(min(2 ** attempt + random.random(), 30)) + return {"status": "failed", "content": "", + "error": f"{type(last_err).__name__}: {str(last_err)[:300]}"} + + +# ---------- High-level helpers ---------- + +async def solve(model_short: str, problem_user_msg: str) -> dict: + """Run the solver. The user message already contains problem + any prefix.""" + return await call_model(model_short, SOLVER_SYSTEM_PROMPT, problem_user_msg, temperature=0.0) + + +async def grade(problem_type: str, problem_statement: str, + solution: str, reference_solution: str) -> dict: + """Run the grader (gpt-4o).""" + if problem_type == "proof": + sys = PROOF_GRADER_SYSTEM_PROMPT + tmpl = PROOF_GRADER_USER_TEMPLATE + else: + sys = CALCULATION_GRADER_SYSTEM_PROMPT + tmpl = CALCULATION_GRADER_USER_TEMPLATE + user = tmpl.format(problem_statement=problem_statement, + solution=solution, + reference_solution=reference_solution) + return await call_model(GRADER_MODEL, sys, user, temperature=0.0) + + +def parse_solution(content: str) -> dict: + """Parse JSON {solution, final_answer} from model output, with tolerance.""" + if not content: + return {"solution": "", "final_answer": "", "_parse_error": "empty"} + try: + d = json.loads(content) + return {"solution": d.get("solution", ""), + "final_answer": d.get("final_answer", ""), + "_parse_error": None} + except Exception: + # Try to extract a JSON object substring + import re + m = re.search(r"\{.*\}", content, re.DOTALL) + if m: + try: + d = json.loads(m.group(0)) + return {"solution": d.get("solution", ""), + "final_answer": d.get("final_answer", ""), + "_parse_error": None} + except Exception as e: + return {"solution": content, "final_answer": "", + "_parse_error": f"json parse: {e}"} + return {"solution": content, "final_answer": "", + "_parse_error": "no JSON object found"} + + +def parse_grade(content: str) -> dict: + """Parse JSON grade output.""" + if not content: + return {"grade": "INCORRECT", "_parse_error": "empty"} + try: + d = json.loads(content) + # Normalize grade + g = (d.get("grade") or "").strip().upper() + return { + "grade": g if g in ("CORRECT", "INCORRECT") else "INCORRECT", + "final_answer_correct": d.get("final_answer_correct"), + "detailed_feedback": d.get("detailed_feedback", ""), + "_parse_error": None, + } + except Exception: + import re + m = re.search(r"\{.*\}", content, re.DOTALL) + if m: + try: + d = json.loads(m.group(0)) + g = (d.get("grade") or "").strip().upper() + return { + "grade": g if g in ("CORRECT", "INCORRECT") else "INCORRECT", + "final_answer_correct": d.get("final_answer_correct"), + "detailed_feedback": d.get("detailed_feedback", ""), + "_parse_error": None, + } + except Exception as e: + return {"grade": "INCORRECT", "_parse_error": f"json parse: {e}"} + return {"grade": "INCORRECT", "_parse_error": "no JSON object found"} + + +# ---------- Standalone health check ---------- + +async def _health_check(): + print("Running health checks ...") + msg = ('Reply with JSON {"status": "ok"} only.') + for short in ["gpt-4o-mini", "claude-sonnet-4", "gemini-2.5-flash"]: + r = await call_model(short, "You are a test. Reply only the requested JSON.", + msg, temperature=0.0, max_tokens=200, retries=2) + print(f" {short}: {r['status']} - {r['content'][:200]!r} err={r['error']}") + # Grader + r = await call_model(GRADER_MODEL, "You are a test.", msg, temperature=0.0, + max_tokens=200, retries=2) + print(f" {GRADER_MODEL} (grader): {r['status']} - {r['content'][:200]!r} err={r['error']}") + + +if __name__ == "__main__": + asyncio.run(_health_check()) diff --git a/analysis/rescue_pooled.py b/analysis/rescue_pooled.py new file mode 100644 index 0000000..cc9f782 --- /dev/null +++ b/analysis/rescue_pooled.py @@ -0,0 +1,174 @@ +"""Pooled rescue analysis for the rebuttal headline. + +Reports: +1. Per-variant pooled rebound rates with Wilson 95% CI for each condition +2. Pooled McNemar (paired) tests across all 4 models per variant +3. Pooled McNemar across all 5 surface variants for each model +4. Headline single-cell numbers +""" +from __future__ import annotations +import json +import math +import statistics +from collections import defaultdict +from pathlib import Path + +PATH = Path("/home/yurenh2/gap/analysis/rescue_results/rescue_30.jsonl") +OUT_PATH = Path("/home/yurenh2/gap/analysis/rescue_pooled_summary.json") + + +def wilson_ci(k: int, n: int, z: float = 1.96): + if n == 0: + return (0.0, 0.0, 0.0) + p = k / n + denom = 1 + z * z / n + center = (p + z * z / (2 * n)) / denom + half = z * math.sqrt(p * (1 - p) / n + z * z / (4 * n * n)) / denom + return (p, max(0.0, center - half), min(1.0, center + half)) + + +def mcnemar_p(b: int, c: int) -> float: + n = b + c + if n == 0: + return 1.0 + k = min(b, c) + cum = sum(math.comb(n, i) * (0.5 ** n) for i in range(k + 1)) + return min(1.0, 2 * cum) + + +def main(): + rows = [json.loads(l) for l in open(PATH)] + print(f"Loaded {len(rows)} rows\n") + + # case_grades[(model, variant, index)] = {cond: grade} + case_grades = defaultdict(dict) + for r in rows: + case_grades[(r["model"], r["variant"], r["index"])][r["condition"]] = r.get("grade") + + variants_order = ["descriptive_long", "descriptive_long_confusing", + "descriptive_long_misleading", "garbled_string", "kernel_variant"] + short = {"descriptive_long":"DL","descriptive_long_confusing":"DLC", + "descriptive_long_misleading":"DLM","garbled_string":"GS","kernel_variant":"KV"} + + summary = {} + + print("=" * 92) + print("HEADLINE: Rescue rebound by variant (pooled across 4 models)") + print("=" * 92) + print(f"{'Variant':<6} {'Condition':<14} {'k/n':>10} {'rate':>7} " + f"{'95% Wilson CI':>20} {'Δ vs null':>11}") + print("-" * 80) + var_summary = {} + for v in variants_order: + # Pool counts across models + cell_counts = defaultdict(lambda: {"k": 0, "n": 0}) + for k, grds in case_grades.items(): + if k[1] != v: continue + for cond in ("null", "canonical_T2", "own_T2"): + if cond in grds: + cell_counts[cond]["n"] += 1 + if grds[cond] == "CORRECT": + cell_counts[cond]["k"] += 1 + # Wilson CIs + per_cond = {} + null_p = cell_counts["null"]["k"] / max(1, cell_counts["null"]["n"]) + for cond in ("null", "canonical_T2", "own_T2"): + if cond not in cell_counts: continue + c = cell_counts[cond] + if c["n"] == 0: continue + p, lo, hi = wilson_ci(c["k"], c["n"]) + delta = (p - null_p) * 100 if cond != "null" else 0.0 + per_cond[cond] = {"k": c["k"], "n": c["n"], "p": p, "ci": [lo, hi], "delta_pp": delta} + print(f"{short[v]:<6} {cond:<14} {c['k']:>4}/{c['n']:>4} " + f"{p*100:>5.1f}% [{lo*100:>5.1f}%, {hi*100:>5.1f}%] " + f"{'+' if delta > 0 else ('' if delta == 0 else '-')}{abs(delta):>5.1f} pp") + # Pooled McNemar (own vs null, can vs null, own vs can) + mc = {} + for a, b in [("canonical_T2", "null"), ("own_T2", "null"), + ("own_T2", "canonical_T2")]: + b_count = c_count = 0 + for k, grds in case_grades.items(): + if k[1] != v: continue + ga = grds.get(a); gb = grds.get(b) + if ga is None or gb is None: continue + if ga == "CORRECT" and gb == "INCORRECT": b_count += 1 + elif ga == "INCORRECT" and gb == "CORRECT": c_count += 1 + p = mcnemar_p(b_count, c_count) + mc[f"{a}_vs_{b}"] = {"b": b_count, "c": c_count, "p": p} + var_summary[v] = {"per_cond": per_cond, "mcnemar": mc} + print() + + summary["per_variant"] = var_summary + + # Pooled McNemar across all surface variants for canonical vs null and own vs null + print("\n" + "=" * 92) + print("POOLED McNEMAR (across all 4 surface variants × 4 models)") + print("=" * 92) + surface_vs = ["descriptive_long", "descriptive_long_confusing", + "descriptive_long_misleading", "garbled_string"] + for a, b in [("canonical_T2", "null"), ("own_T2", "null"), + ("own_T2", "canonical_T2")]: + b_count = c_count = 0 + for k, grds in case_grades.items(): + if k[1] not in surface_vs: continue + ga = grds.get(a); gb = grds.get(b) + if ga is None or gb is None: continue + if ga == "CORRECT" and gb == "INCORRECT": b_count += 1 + elif ga == "INCORRECT" and gb == "CORRECT": c_count += 1 + p = mcnemar_p(b_count, c_count) + n = b_count + c_count + odds_ratio = b_count / max(1, c_count) + print(f" {a:<14} > {b:<14} b={b_count:>4}, c={c_count:>4} " + f"OR={odds_ratio:>4.2f} McNemar p={p:.2e} (n_discordant={n})") + # KV separately + print() + for a, b in [("canonical_T2", "null")]: + b_count = c_count = 0 + for k, grds in case_grades.items(): + if k[1] != "kernel_variant": continue + ga = grds.get(a); gb = grds.get(b) + if ga is None or gb is None: continue + if ga == "CORRECT" and gb == "INCORRECT": b_count += 1 + elif ga == "INCORRECT" and gb == "CORRECT": c_count += 1 + p = mcnemar_p(b_count, c_count) + odds_ratio = b_count / max(1, c_count) + print(f" KV: {a:<14} > {b:<14} b={b_count:>4}, c={c_count:>4} " + f"OR={odds_ratio:>4.2f} McNemar p={p:.2e}") + + # Per model summary + print("\n" + "=" * 92) + print("PER MODEL (averaged across 4 surface variants)") + print("=" * 92) + print(f"{'Model':<22} {'null':>10} {'canonical_T2':>14} {'own_T2':>10} " + f"{'can-null':>10} {'own-null':>10}") + per_model = {} + for model in sorted({k[0] for k in case_grades}): + cnts = defaultdict(lambda: {"k": 0, "n": 0}) + for k, grds in case_grades.items(): + if k[0] != model: continue + if k[1] not in surface_vs: continue + for cond in ("null", "canonical_T2", "own_T2"): + if cond in grds: + cnts[cond]["n"] += 1 + if grds[cond] == "CORRECT": + cnts[cond]["k"] += 1 + nul_p = cnts["null"]["k"] / max(1, cnts["null"]["n"]) + can_p = cnts["canonical_T2"]["k"] / max(1, cnts["canonical_T2"]["n"]) + own_p = cnts["own_T2"]["k"] / max(1, cnts["own_T2"]["n"]) + per_model[model] = { + "null": {"k": cnts["null"]["k"], "n": cnts["null"]["n"], "p": nul_p}, + "canonical_T2": {"k": cnts["canonical_T2"]["k"], "n": cnts["canonical_T2"]["n"], "p": can_p}, + "own_T2": {"k": cnts["own_T2"]["k"], "n": cnts["own_T2"]["n"], "p": own_p}, + "can_minus_null_pp": (can_p - nul_p) * 100, + "own_minus_null_pp": (own_p - nul_p) * 100, + } + print(f" {model:<20} {nul_p*100:>9.1f}% {can_p*100:>13.1f}% {own_p*100:>9.1f}% " + f"{(can_p-nul_p)*100:>+9.1f}pp {(own_p-nul_p)*100:>+9.1f}pp") + summary["per_model"] = per_model + + json.dump(summary, open(OUT_PATH, "w"), indent=2) + print(f"\nSaved -> {OUT_PATH}") + + +if __name__ == "__main__": + main() diff --git a/analysis/rescue_prompts.py b/analysis/rescue_prompts.py new file mode 100644 index 0000000..8e8f65c --- /dev/null +++ b/analysis/rescue_prompts.py @@ -0,0 +1,267 @@ +"""Rescue-experiment prompt construction. + +For each (model, variant, flip-case) we build prompts under three conditions: +- own_T2: model's own original-correct trajectory truncated at first + formal equation (with leakage filter), variables auto-renamed + to variant names via the dataset's rename map +- canonical_T2: the dataset's canonical variant solution truncated at first + formal equation (no rename needed; already in variant naming) +- null: generic content-free scaffold + +Truncation rule (event-boundary): + 1. Find the FIRST display-math block ($$...$$, \\[...\\], \\begin{equation/align/...}) + 2. If none, fall back to the first line containing a substantive math relation + (>=, <=, =, <, >, ≡, ∈) that is not merely a definition (e.g., 'let x:=...') + 3. The T2 prefix INCLUDES that first formal relation + 4. Apply leakage filter BEFORE returning: stop at the earliest of: + - any line containing \\boxed + - any line containing 'therefore', 'hence', 'we conclude', 'the answer', + 'we obtain', 'thus', 'it suffices', 'we have proved', 'as a result' + - any line containing the dataset's recorded final_answer string +""" +from __future__ import annotations +import re +from typing import Optional, Dict + + +# ---------- Display-math detection ---------- + +# Order matters: try richest patterns first +_DISPLAY_MATH_PATTERNS = [ + re.compile(r"\$\$.+?\$\$", re.DOTALL), + re.compile(r"\\\[.+?\\\]", re.DOTALL), + re.compile(r"\\begin\{equation\*?\}.+?\\end\{equation\*?\}", re.DOTALL), + re.compile(r"\\begin\{align\*?\}.+?\\end\{align\*?\}", re.DOTALL), + re.compile(r"\\begin\{gather\*?\}.+?\\end\{gather\*?\}", re.DOTALL), + re.compile(r"\\begin\{eqnarray\*?\}.+?\\end\{eqnarray\*?\}", re.DOTALL), +] + + +def _first_display_math_end(text: str) -> Optional[int]: + """Return the end position of the first display-math block, or None.""" + earliest = None + for pat in _DISPLAY_MATH_PATTERNS: + m = pat.search(text) + if m: + if earliest is None or m.end() < earliest: + earliest = m.end() + return earliest + + +# Inline relation fallback: first line with a "real" relation +_INLINE_REL_RE = re.compile( + r"[A-Za-z\)\]\}\d_]\s*(?:=|<|>|\\le[q]?|\\ge[q]?|\\equiv|\\in)\s*[A-Za-z\(\[\{\d\\\-]" +) +# Definition exclusion: lines that are 'let x = ...' or 'denote ...' are setup, +# not actual derivations. We allow them in the prefix but don't stop on them. +_DEFINITION_RE = re.compile( + r"^\s*(?:let|denote|define|set|put|call|consider|introduce|let us)\b", + re.IGNORECASE +) + + +def _first_inline_relation_line_end(text: str) -> Optional[int]: + """Find the end of the first line containing a non-definition math relation. + + Returns absolute character offset (one past the newline).""" + pos = 0 + while pos < len(text): + nl = text.find("\n", pos) + line_end = nl if nl != -1 else len(text) + line = text[pos:line_end] + if _INLINE_REL_RE.search(line) and not _DEFINITION_RE.search(line): + return line_end + 1 if nl != -1 else line_end + pos = line_end + 1 + if nl == -1: + break + return None + + +# ---------- Leakage detection ---------- + +LEAKAGE_PATTERNS = [ + re.compile(r"\\boxed\b", re.IGNORECASE), + re.compile(r"\btherefore\b", re.IGNORECASE), + re.compile(r"\bhence\b", re.IGNORECASE), + re.compile(r"\bwe conclude\b", re.IGNORECASE), + re.compile(r"\bthe answer\b", re.IGNORECASE), + re.compile(r"\bwe obtain\b", re.IGNORECASE), + re.compile(r"\bthus\b", re.IGNORECASE), + re.compile(r"\bit suffices\b", re.IGNORECASE), + re.compile(r"\bwe have proved\b", re.IGNORECASE), + re.compile(r"\bwe have shown\b", re.IGNORECASE), + re.compile(r"\bas a result\b", re.IGNORECASE), + re.compile(r"\bin conclusion\b", re.IGNORECASE), + re.compile(r"\bthe final answer\b", re.IGNORECASE), + re.compile(r"\bso the answer\b", re.IGNORECASE), +] + + +def _first_leakage_pos(text: str, final_answer: Optional[str] = None) -> Optional[int]: + """Return the starting char position of the earliest leakage marker.""" + earliest = None + for pat in LEAKAGE_PATTERNS: + m = pat.search(text) + if m: + if earliest is None or m.start() < earliest: + earliest = m.start() + if final_answer: + # Final-answer leakage: only check if the answer string is non-trivial + fa = final_answer.strip() + if 8 <= len(fa) <= 200: + idx = text.find(fa) + if idx != -1: + if earliest is None or idx < earliest: + earliest = idx + return earliest + + +# ---------- T2 truncation ---------- + +MIN_PREFIX_CHARS = 50 +MAX_PREFIX_CHARS = 2400 # roughly 600 tokens + + +def truncate_T2(text: str, final_answer: Optional[str] = None) -> Optional[str]: + """Return the T2 (after-first-equation) prefix, or None if not detectable. + + T2 = up to and including the first formal equation, then capped by leakage + filter and MAX_PREFIX_CHARS. + """ + if not text: + return None + end = _first_display_math_end(text) + if end is None: + end = _first_inline_relation_line_end(text) + if end is None: + return None + prefix = text[:end] + # Apply leakage filter BEFORE the equation if a leakage marker appears earlier + leak = _first_leakage_pos(prefix, final_answer) + if leak is not None and leak < end: + prefix = text[:leak].rstrip() + # Cap length + if len(prefix) > MAX_PREFIX_CHARS: + prefix = prefix[:MAX_PREFIX_CHARS] + # Trim at last newline to avoid cutting mid-sentence + last_nl = prefix.rfind("\n") + if last_nl > MIN_PREFIX_CHARS: + prefix = prefix[:last_nl] + if len(prefix) < MIN_PREFIX_CHARS: + return None + return prefix.rstrip() + + +# ---------- Variable rename for own prefix ---------- + +def rename_own_prefix(prefix: str, rename_map: Dict[str, str]) -> str: + """Apply orig->variant rename mapping to the model's own prefix. + + Sort longest-first to avoid prefix collisions (e.g., 'al' eating 'almondtree'). + Use word-boundary regex. Pass replacement via lambda to avoid escape-sequence + interpretation when the variant name starts with '\\x', '\\g', etc. + """ + if not prefix or not rename_map: + return prefix + items = sorted(rename_map.items(), key=lambda kv: -len(kv[0])) + out = prefix + for src, dst in items: + if not src: + continue + pat = r"(?<![A-Za-z0-9_])" + re.escape(src) + r"(?![A-Za-z0-9_])" + # Use a lambda so dst is treated literally (no \1, \x, etc. escapes). + out = re.sub(pat, lambda _m, _dst=dst: _dst, out) + return out + + +# ---------- Null scaffold ---------- + +NULL_SCAFFOLD = ( + "Let us proceed carefully. We will first identify the relevant variables " + "and their roles, then state the governing relations of the problem, and " + "finally develop the argument step by step." +) + + +# ---------- Prompt builders ---------- + +# We tell the model to PRODUCE the complete solution that begins with the +# provided prefix verbatim. This means the grader will see one continuous +# solution that starts with the injected setup. The instruction to begin +# verbatim avoids the model paraphrasing the prefix and removing the very +# representational anchor we are testing. + +RESCUE_USER_TEMPLATE = """Please solve the following mathematical problem. + +PROBLEM: +{problem_statement} + +You must structure your solution as a continuation of the partial work below. +Begin your solution with the partial work copied verbatim, then continue +seamlessly to a complete answer. + +PARTIAL WORK (to copy verbatim at the start of your solution): +{prefix} + +Provide a complete, rigorous solution. Return your response in JSON format: +{{"solution": "your complete solution starting with the partial work above and continuing to the end", + "final_answer": "your final answer in clear, concise form"}}""" + + +NULL_USER_TEMPLATE = """Please solve the following mathematical problem. + +PROBLEM: +{problem_statement} + +{scaffold} + +Provide a complete, rigorous solution. Return your response in JSON format: +{{"solution": "your complete step-by-step solution", + "final_answer": "your final answer in clear, concise form"}}""" + + +def build_rescue_prompt(problem_statement: str, prefix: str) -> str: + return RESCUE_USER_TEMPLATE.format( + problem_statement=problem_statement, prefix=prefix) + + +def build_null_prompt(problem_statement: str) -> str: + return NULL_USER_TEMPLATE.format( + problem_statement=problem_statement, scaffold=NULL_SCAFFOLD) + + +# ---------- Smoke test ---------- + +if __name__ == "__main__": + # Quick smoke test on a real flip case + import json + import sys + sys.path.insert(0, "/home/yurenh2/gap/analysis") + from structural_overlap import find_variant_file, load_problems + + # Pick gpt-4.1-mini original on a known problem + op = find_variant_file( + __import__("pathlib").Path("/home/yurenh2/gap/results_new/gpt-4.1-mini"), + "original") + probs = {p["index"]: p for p in load_problems(op)} + sample = next(p for idx, p in probs.items() + if p.get("correct") is True and (p.get("solve") or {}).get("solution")) + text = sample["solve"]["solution"] + fa = sample["solve"].get("final_answer") + print(f"Sample index: {sample['index']}, type: {sample['problem_type']}") + print(f"Original solution length: {len(text)} chars") + print(f"Recorded final_answer: {fa[:200] if fa else None!r}") + pre = truncate_T2(text, fa) + print(f"\n--- T2 PREFIX ({len(pre or '')} chars) ---") + print(pre) + print("--- END ---") + + # Test rename: load 1987-B-2 dataset to get a sample map + ds = json.load(open("/home/yurenh2/gap/putnam-bench-anon/dataset/1987-B-2.json")) + rmap_raw = ds["variants"]["garbled_string"]["map"] + rmap = (eval(rmap_raw, {"__builtins__": {}}, {}) + if isinstance(rmap_raw, str) else rmap_raw) + print(f"\nRename map: {rmap}") + test_text = "Let n be a positive integer and let f be a continuous function. Then $f(n) = 0$." + print(f"\nOriginal: {test_text}") + print(f"Renamed: {rename_own_prefix(test_text, rmap)}") diff --git a/analysis/rescue_runner.py b/analysis/rescue_runner.py new file mode 100644 index 0000000..9c9f226 --- /dev/null +++ b/analysis/rescue_runner.py @@ -0,0 +1,341 @@ +"""End-to-end rescue experiment runner. + +For each (model, variant, flip-case): + - Build 3 prompts: own_T2, canonical_T2, null (KV: only canonical_T2 + null) + - Solve with the same model the case originally failed under + - Grade with gpt-4o using the variant problem + canonical variant solution as reference + - Save per-case results immediately to a jsonl checkpoint (resumable) + +Usage: + python rescue_runner.py --pilot # 5 cases per cell (smoke test) + python rescue_runner.py # 30 cases per cell (full run) +""" +from __future__ import annotations +import argparse +import asyncio +import json +import os +import random +import sys +import time +from pathlib import Path +from typing import Optional + +# Local imports +THIS_DIR = Path(__file__).resolve().parent +sys.path.insert(0, str(THIS_DIR)) +from rescue_prompts import ( + truncate_T2, rename_own_prefix, + build_rescue_prompt, build_null_prompt, NULL_SCAFFOLD, +) +from rescue_api import ( + SOLVER_PROVIDERS, solve, grade, parse_solution, parse_grade, +) +from structural_overlap import ( + DATASET_DIR, RESULTS_DIR, find_variant_file, load_problems, SURFACE_VARIANTS, +) + + +# Short model name -> directory name in results_new +MODEL_RESULTS_DIRS = { + "gpt-4.1-mini": "gpt-4.1-mini", + "gpt-4o-mini": "gpt-4o-mini", + "claude-sonnet-4": "claude-sonnet-4", + "gemini-2.5-flash": "gemini_2.5_flash", # historical underscore naming +} +SELECTED_MODELS = ["gpt-4.1-mini", "gpt-4o-mini", "claude-sonnet-4", "gemini-2.5-flash"] +ALL_VARIANTS = SURFACE_VARIANTS + ["kernel_variant"] +SURFACE_CONDITIONS = ["own_T2", "canonical_T2", "null"] +KV_CONDITIONS = ["canonical_T2", "null"] + + +# ---------- Dataset loading ---------- + +def load_dataset_full() -> dict: + """Returns: {idx: {original: {...}, variants: {v: {map, question, solution}}}}. + + The dataset stores top-level question/solution and variant-keyed question/solution/map. + """ + out = {} + for f in sorted(DATASET_DIR.glob("*.json")): + d = json.load(open(f)) + idx = d.get("index") + cell = { + "problem_type": d.get("problem_type"), + "original_question": d.get("question") or "", + "original_solution": d.get("solution") or "", + "variants": {}, + } + for v, vd in d.get("variants", {}).items(): + if isinstance(vd, dict): + rmap = vd.get("map") + if isinstance(rmap, str): + try: + rmap = eval(rmap, {"__builtins__": {}}, {}) + except Exception: + rmap = None + cell["variants"][v] = { + "question": vd.get("question") or "", + "solution": vd.get("solution") or "", + "map": rmap if isinstance(rmap, dict) else None, + } + out[idx] = cell + return out + + +# ---------- Flip case selection ---------- + +def find_flip_cases(model: str, variant: str, max_cases: int, + seed: int = 42) -> list: + """Identify (orig_correct, var_wrong) flip cases for the cell. + + Returns list of dicts with: index, problem_type, model_orig_solution, + final_answer (recorded), variant_problem_statement (from results). + """ + mdir = RESULTS_DIR / MODEL_RESULTS_DIRS.get(model, model) + op = find_variant_file(mdir, "original") + vp = find_variant_file(mdir, variant) + if not op or not vp: + return [] + orig_by = {p["index"]: p for p in load_problems(op)} + var_by = {p["index"]: p for p in load_problems(vp)} + cases = [] + for idx in sorted(set(orig_by) & set(var_by)): + po, pv = orig_by[idx], var_by[idx] + if po.get("correct") is not True or pv.get("correct") is not False: + continue + orig_text = (po.get("solve") or {}).get("solution") or "" + if not orig_text: + continue + # Skip cases where we couldn't extract a T2 prefix from the original + fa = (po.get("solve") or {}).get("final_answer") or "" + if truncate_T2(orig_text, fa) is None: + continue + cases.append({ + "index": idx, + "problem_type": po.get("problem_type"), + "orig_solution": orig_text, + "orig_final_answer": fa, + }) + rng = random.Random(seed) + rng.shuffle(cases) + return cases[:max_cases] + + +# ---------- Prompt construction per case ---------- + +def build_case_prompts(case: dict, variant: str, ds_cell: dict) -> dict: + """Returns: {condition_name: user_message_string}.""" + var_info = ds_cell["variants"].get(variant, {}) + var_question = var_info.get("question", "") + if not var_question: + return {} + prompts = {} + is_kv = (variant == "kernel_variant") + + # canonical_T2: dataset's canonical variant solution truncated + canon_sol = var_info.get("solution", "") + if canon_sol: + canon_pre = truncate_T2(canon_sol, None) + if canon_pre: + prompts["canonical_T2"] = build_rescue_prompt(var_question, canon_pre) + + # own_T2: only for surface variants — model's own original-correct prefix renamed + if not is_kv: + rmap = var_info.get("map") or {} + own_pre = truncate_T2(case["orig_solution"], case.get("orig_final_answer")) + if own_pre and rmap: + renamed = rename_own_prefix(own_pre, rmap) + prompts["own_T2"] = build_rescue_prompt(var_question, renamed) + + # null: always available + prompts["null"] = build_null_prompt(var_question) + return prompts + + +# ---------- Per-condition runner ---------- + +async def run_one_condition(model: str, condition: str, user_msg: str, + case: dict, variant: str, ds_cell: dict) -> dict: + """Solve + grade a single condition for a single case. Returns a result dict.""" + var_info = ds_cell["variants"].get(variant, {}) + var_question = var_info.get("question", "") + canon_sol = var_info.get("solution", "") + problem_type = case["problem_type"] + t0 = time.time() + solve_resp = await solve(model, user_msg) + solve_dt = time.time() - t0 + if solve_resp["status"] != "success": + return { + "model": model, "variant": variant, "condition": condition, + "index": case["index"], "problem_type": problem_type, + "solve_status": "failed", + "solve_error": solve_resp["error"], + "solve_seconds": solve_dt, + "grade": None, + } + parsed = parse_solution(solve_resp["content"]) + if not parsed["solution"]: + return { + "model": model, "variant": variant, "condition": condition, + "index": case["index"], "problem_type": problem_type, + "solve_status": "parse_failed", + "solve_error": parsed.get("_parse_error"), + "solve_seconds": solve_dt, + "raw_solve_content": solve_resp["content"][:500], + "grade": None, + } + student_solution = parsed["solution"] + t1 = time.time() + grade_resp = await grade(problem_type, var_question, student_solution, canon_sol) + grade_dt = time.time() - t1 + if grade_resp["status"] != "success": + return { + "model": model, "variant": variant, "condition": condition, + "index": case["index"], "problem_type": problem_type, + "solve_status": "success", + "solve_seconds": solve_dt, + "grade_seconds": grade_dt, + "grade_status": "failed", + "grade_error": grade_resp["error"], + "student_solution_len": len(student_solution), + "student_final_answer": parsed["final_answer"], + "grade": None, + } + parsed_grade = parse_grade(grade_resp["content"]) + return { + "model": model, "variant": variant, "condition": condition, + "index": case["index"], "problem_type": problem_type, + "solve_status": "success", + "solve_seconds": solve_dt, + "grade_seconds": grade_dt, + "grade_status": "success", + "student_solution_len": len(student_solution), + "student_solution": student_solution, # full text for downstream analysis + "student_final_answer": parsed["final_answer"][:500], + "grade": parsed_grade["grade"], + "final_answer_correct": parsed_grade.get("final_answer_correct"), + "grade_feedback": (parsed_grade.get("detailed_feedback") or "")[:1000], + } + + +# ---------- Main run ---------- + +OUT_DIR = Path("/home/yurenh2/gap/analysis/rescue_results") +OUT_DIR.mkdir(parents=True, exist_ok=True) + + +def load_existing_keys(path: Path) -> set: + """Read jsonl checkpoint and return set of (cell_key, condition, index).""" + keys = set() + if not path.exists(): + return keys + with open(path) as f: + for line in f: + try: + d = json.loads(line) + keys.add((d["model"], d["variant"], d["condition"], d["index"])) + except Exception: + pass + return keys + + +async def run_all(num_cases_per_cell: int, dry_run: bool = False, models=None, + variants=None): + print(f"Loading dataset ...", flush=True) + ds = load_dataset_full() + print(f" loaded {len(ds)} problems", flush=True) + + out_path = OUT_DIR / f"rescue_{num_cases_per_cell}.jsonl" + existing = load_existing_keys(out_path) + print(f"Output: {out_path} (existing rows: {len(existing)})") + + models = models or SELECTED_MODELS + variants = variants or ALL_VARIANTS + + # Build the full task list + tasks_to_run = [] + cell_summary = {} + for model in models: + for variant in variants: + cases = find_flip_cases(model, variant, num_cases_per_cell) + cell_key = f"{model}/{variant}" + cell_summary[cell_key] = {"flip_cases_found": len(cases), + "added_tasks": 0} + for case in cases: + ds_cell = ds.get(case["index"]) + if ds_cell is None: + continue + prompts = build_case_prompts(case, variant, ds_cell) + for cond, user_msg in prompts.items(): + key = (model, variant, cond, case["index"]) + if key in existing: + continue + tasks_to_run.append((model, variant, cond, case, ds_cell, user_msg)) + cell_summary[cell_key]["added_tasks"] += 1 + + print(f"\nCell-level plan ({num_cases_per_cell} flip cases each):") + for k, v in sorted(cell_summary.items()): + print(f" {k:<46} found={v['flip_cases_found']:>3} new_tasks={v['added_tasks']:>4}") + total = len(tasks_to_run) + print(f"\nTotal new tasks: {total}") + if dry_run: + return + + if not tasks_to_run: + print("Nothing to do.") + return + + # Execute concurrently. Use a writer task to drain results into the jsonl. + fout = open(out_path, "a") + write_lock = asyncio.Lock() + completed = 0 + failed = 0 + started_at = time.time() + + async def run_and_write(model, variant, cond, case, ds_cell, user_msg): + nonlocal completed, failed + try: + res = await run_one_condition(model, cond, user_msg, case, variant, ds_cell) + except Exception as e: + res = { + "model": model, "variant": variant, "condition": cond, + "index": case["index"], "problem_type": case.get("problem_type"), + "solve_status": "exception", + "solve_error": f"{type(e).__name__}: {str(e)[:300]}", + "grade": None, + } + failed += 1 + async with write_lock: + fout.write(json.dumps(res) + "\n") + fout.flush() + completed += 1 + if completed % 25 == 0 or completed == total: + elapsed = time.time() - started_at + rate = completed / elapsed if elapsed > 0 else 0 + eta = (total - completed) / rate if rate > 0 else 0 + print(f" [{completed:>4}/{total}] elapsed={elapsed:>5.0f}s " + f"rate={rate:>4.1f}/s eta={eta:>5.0f}s " + f"failed_so_far={failed}", flush=True) + + awaitables = [run_and_write(*t) for t in tasks_to_run] + await asyncio.gather(*awaitables) + fout.close() + print(f"\nDone. {completed}/{total} written. Failed: {failed}.") + + +def main(): + ap = argparse.ArgumentParser() + ap.add_argument("--pilot", action="store_true", help="run only 5 cases per cell") + ap.add_argument("--cases", type=int, default=30, help="cases per cell (full run)") + ap.add_argument("--dry-run", action="store_true", help="print plan, don't call APIs") + ap.add_argument("--models", nargs="+", default=None) + ap.add_argument("--variants", nargs="+", default=None) + args = ap.parse_args() + n = 5 if args.pilot else args.cases + asyncio.run(run_all(n, dry_run=args.dry_run, + models=args.models, variants=args.variants)) + + +if __name__ == "__main__": + main() diff --git a/analysis/sc_success_and_difficulty.py b/analysis/sc_success_and_difficulty.py new file mode 100644 index 0000000..a8b44db --- /dev/null +++ b/analysis/sc_success_and_difficulty.py @@ -0,0 +1,192 @@ +"""Two follow-up analyses (zero API): +1. Per-model self-correction success rate: P(correct | SC) vs P(correct | no SC) +2. Difficulty-stratified surface vs kernel dichotomy +""" +from __future__ import annotations +import json +import sys +import statistics +from pathlib import Path +from collections import defaultdict + +THIS_DIR = Path(__file__).resolve().parent +sys.path.insert(0, str(THIS_DIR)) +from structural_overlap import find_variant_file, load_problems, RESULTS_DIR, SURFACE_VARIANTS +from self_correction import has_self_correction + + +# ----------------- 1. SC success rate per model ----------------- + +def sc_success_rate(): + base = RESULTS_DIR + models = sorted([d.name for d in base.iterdir() if d.is_dir()]) + + print("=" * 80) + print("PER-MODEL SELF-CORRECTION SUCCESS RATE") + print("(does an SC attempt improve probability of being correct?)") + print("=" * 80) + print() + + rows = [] + for m in models: + mdir = base / m + # Aggregate over all variants + n_sc_correct = 0 + n_sc_total = 0 + n_nosc_correct = 0 + n_nosc_total = 0 + for v in ["original"] + SURFACE_VARIANTS + ["kernel_variant"]: + vp = find_variant_file(mdir, v) + if not vp: continue + for p in load_problems(vp): + text = (p.get("solve") or {}).get("solution") or "" + if not text: continue + correct = p.get("correct") + if correct is None: continue + if has_self_correction(text): + n_sc_total += 1 + if correct: n_sc_correct += 1 + else: + n_nosc_total += 1 + if correct: n_nosc_correct += 1 + if n_sc_total < 5 or n_nosc_total < 5: + continue + p_sc = n_sc_correct / n_sc_total + p_nosc = n_nosc_correct / n_nosc_total + delta = p_sc - p_nosc + # Wilson 95% CI on each rate + rows.append({ + "model": m, + "sc_n": n_sc_total, "sc_correct": n_sc_correct, "p_sc": p_sc, + "nosc_n": n_nosc_total, "nosc_correct": n_nosc_correct, "p_nosc": p_nosc, + "delta": delta, + }) + + rows.sort(key=lambda r: -r["sc_n"]) + print(f"{'Model':<22} {'#SC trials':>11} {'P(corr|SC)':>12} {'P(corr|noSC)':>13} {'Δ':>9}") + print("-" * 75) + for r in rows: + print(f"{r['model']:<22} {r['sc_n']:>11} " + f"{r['p_sc']*100:>10.1f}% {r['p_nosc']*100:>11.1f}% " + f"{r['delta']*100:>+7.1f}pp") + + json.dump(rows, open(THIS_DIR / "sc_success_per_model.json", "w"), indent=2) + return rows + + +# ----------------- 2. Difficulty stratified dichotomy ----------------- + +DATASET_DIR = Path("/home/yurenh2/gap/putnam-bench-anon/dataset") + +def load_difficulty_metadata(): + """Per-problem difficulty assignment using year/section/index heuristic. + + Per the paper's existing exposition, we derive Easy/Medium/Hard from the + problem index (1-2 = Easy, 3-4 = Medium, 5-6 = Hard, 7-8 = extra-hard tail) + because the dataset's `difficulty` field is heterogeneous. + """ + out = {} + for f in sorted(DATASET_DIR.glob("*.json")): + d = json.load(open(f)) + idx = d.get("index") + if not idx: continue + # Extract problem number from "YEAR-PART-NUM" + parts = idx.split("-") + if len(parts) != 3: continue + try: + num = int(parts[2]) + except ValueError: + continue + if num <= 2: bucket = "Easy" + elif num <= 4: bucket = "Medium" + elif num <= 6: bucket = "Hard" + else: bucket = "ExtraHard" + out[idx] = bucket + return out + + +def difficulty_stratified_dichotomy(): + print("\n\n" + "=" * 80) + print("DIFFICULTY-STRATIFIED ACCURACY (mean across 18 models)") + print("Easy/Medium/Hard buckets defined by problem index 1-2/3-4/5-6") + print("=" * 80) + print() + + diff = load_difficulty_metadata() + base = RESULTS_DIR + models = sorted([d.name for d in base.iterdir() if d.is_dir()]) + + # buckets[(model, variant, difficulty)] = (n, n_correct) + cells = defaultdict(lambda: [0, 0]) + for m in models: + mdir = base / m + for v in ["original"] + SURFACE_VARIANTS + ["kernel_variant"]: + vp = find_variant_file(mdir, v) + if not vp: continue + for p in load_problems(vp): + idx = p.get("index") + correct = p.get("correct") + if idx is None or correct is None: continue + bucket = diff.get(idx, "Unknown") + cells[(m, v, bucket)][0] += 1 + if correct: cells[(m, v, bucket)][1] += 1 + + # Aggregate per (variant, difficulty) by averaging per-model rates + print(f"{'Variant':<24} {'Easy':>8} {'Medium':>8} {'Hard':>8} {'XHard':>8}") + print("-" * 60) + for v in ["original"] + SURFACE_VARIANTS + ["kernel_variant"]: + row = {} + for bucket in ["Easy", "Medium", "Hard", "ExtraHard"]: + rates = [] + for m in models: + n, c = cells.get((m, v, bucket), [0, 0]) + if n >= 5: + rates.append(c / n) + row[bucket] = statistics.fmean(rates) * 100 if rates else None + print(f"{v:<24} " + f"{row['Easy']:>7.1f}% " if row['Easy'] is not None else f"{v:<24} {'-':>8}", + end="") + for bucket in ["Medium", "Hard", "ExtraHard"]: + print(f"{row[bucket]:>7.1f}% " if row[bucket] is not None else f"{'-':>8}", end="") + print() + + # Compute Δ_orig→KV per difficulty bucket + print(f"\n--- Δ original → KV per difficulty bucket ---") + for bucket in ["Easy", "Medium", "Hard", "ExtraHard"]: + orig_rates = [] + kv_rates = [] + for m in models: + no, co = cells.get((m, "original", bucket), [0, 0]) + nk, ck = cells.get((m, "kernel_variant", bucket), [0, 0]) + if no >= 5 and nk >= 5: + orig_rates.append(co / no) + kv_rates.append(ck / nk) + if orig_rates: + mo = statistics.fmean(orig_rates) * 100 + mk = statistics.fmean(kv_rates) * 100 + print(f" {bucket:<10} orig={mo:5.1f}% kv={mk:5.1f}% Δ={mk-mo:+.1f}pp") + + # Compute Δ_orig→GS per difficulty bucket + print(f"\n--- Δ original → GS (surface, hardest renamer) per difficulty bucket ---") + for bucket in ["Easy", "Medium", "Hard", "ExtraHard"]: + orig_rates = [] + gs_rates = [] + for m in models: + no, co = cells.get((m, "original", bucket), [0, 0]) + ng, cg = cells.get((m, "garbled_string", bucket), [0, 0]) + if no >= 5 and ng >= 5: + orig_rates.append(co / no) + gs_rates.append(cg / ng) + if orig_rates: + mo = statistics.fmean(orig_rates) * 100 + mg = statistics.fmean(gs_rates) * 100 + print(f" {bucket:<10} orig={mo:5.1f}% GS={mg:5.1f}% Δ={mg-mo:+.1f}pp") + + +def main(): + sc_success_rate() + difficulty_stratified_dichotomy() + + +if __name__ == "__main__": + main() diff --git a/analysis/self_correction.py b/analysis/self_correction.py new file mode 100644 index 0000000..5769647 --- /dev/null +++ b/analysis/self_correction.py @@ -0,0 +1,202 @@ +"""Self-correction / metacognition probe. + +Scan model trajectories for self-correction markers and compute: +1. Attempt rate (trajectory contains a self-correction marker) per (model, variant, group) +2. Whether self-correction attempt rate differs between stable / brittle-drift / rescued cases +3. Conditional success: among trajectories with a self-correction attempt, what fraction is graded CORRECT? + +Self-correction markers (case-insensitive, word-boundary): +- "wait" (e.g., "Wait, let me reconsider") +- "actually" (e.g., "Actually, I think...") +- "let me reconsider" +- "let me redo" +- "let me try again" +- "I made a mistake" +- "this is wrong" +- "on second thought" +- "correction:" +- "scratch that" +- "I was wrong" +- "let me start over" + +Uses three data sources: +A. The original 18-model results in /home/yurenh2/gap/results_new/ (stable + brittle drift + collapse) +B. The rescue trajectories in analysis/rescue_results/rescue_30.jsonl (3 conditions × 4 models × 5 variants) +""" +from __future__ import annotations +import json +import re +import os +import sys +import statistics +from pathlib import Path +from collections import defaultdict, Counter + +THIS_DIR = Path(__file__).resolve().parent +sys.path.insert(0, str(THIS_DIR)) +from structural_overlap import find_variant_file, load_problems, RESULTS_DIR, SURFACE_VARIANTS + +SC_PATTERNS = [ + re.compile(r"\bwait\b[,.]?\s+(let|actually|that|i)", re.IGNORECASE), + re.compile(r"\bactually[,.]\s", re.IGNORECASE), + re.compile(r"\blet\s+me\s+reconsider", re.IGNORECASE), + re.compile(r"\blet\s+me\s+redo", re.IGNORECASE), + re.compile(r"\blet\s+me\s+try\s+(this\s+)?again", re.IGNORECASE), + re.compile(r"\bi\s+made\s+a\s+mistake", re.IGNORECASE), + re.compile(r"\bthis\s+is\s+(wrong|incorrect)", re.IGNORECASE), + re.compile(r"\bon\s+second\s+thought", re.IGNORECASE), + re.compile(r"\bcorrection[:\s]", re.IGNORECASE), + re.compile(r"\bscratch\s+that", re.IGNORECASE), + re.compile(r"\bi\s+was\s+wrong", re.IGNORECASE), + re.compile(r"\blet\s+me\s+start\s+over", re.IGNORECASE), + re.compile(r"\bhmm[,.]\s+(actually|wait|that)", re.IGNORECASE), + re.compile(r"\bi\s+need\s+to\s+(redo|reconsider)", re.IGNORECASE), + re.compile(r"\boh\s+wait", re.IGNORECASE), +] + + +def has_self_correction(text: str) -> bool: + if not text: + return False + for pat in SC_PATTERNS: + if pat.search(text): + return True + return False + + +def count_sc_markers(text: str) -> int: + if not text: + return 0 + return sum(len(pat.findall(text)) for pat in SC_PATTERNS) + + +# ---------- Source A: 18-model original results ---------- + +def analyze_18_models(): + """Self-correction rates in original solver runs across all 18 models.""" + base = RESULTS_DIR + models = sorted([d.name for d in base.iterdir() if d.is_dir()]) + print(f"\n=== SELF-CORRECTION IN 18-MODEL ORIGINAL RUNS ===\n") + print(f"Markers used: {len(SC_PATTERNS)} regex patterns") + print(f"Definition: trajectory contains at least one match.\n") + + rows = [] + for m in models: + mdir = base / m + for v in ["original"] + SURFACE_VARIANTS + ["kernel_variant"]: + vp = find_variant_file(mdir, v) + if not vp: + continue + problems = load_problems(vp) + n_total = 0 + n_sc = 0 + n_correct_sc = 0 + n_correct_total = 0 + n_wrong_sc = 0 + n_wrong_total = 0 + for p in problems: + text = (p.get("solve") or {}).get("solution") or "" + if not text: + continue + correct = p.get("correct") + if correct is None: + continue + n_total += 1 + sc = has_self_correction(text) + if sc: n_sc += 1 + if correct is True: + n_correct_total += 1 + if sc: n_correct_sc += 1 + else: + n_wrong_total += 1 + if sc: n_wrong_sc += 1 + if n_total > 0: + rows.append({ + "model": m, "variant": v, "n": n_total, + "sc_rate": n_sc / n_total, + "n_correct": n_correct_total, + "n_correct_sc_rate": n_correct_sc / max(1, n_correct_total), + "n_wrong": n_wrong_total, + "n_wrong_sc_rate": n_wrong_sc / max(1, n_wrong_total), + }) + + # Print compact table: per (variant) average across models + print(f"{'Variant':<24} {'mean SC%':>10} {'SC%|correct':>14} {'SC%|wrong':>12} {'asym (wrong-correct)':>22}") + print("-" * 90) + by_var = defaultdict(list) + for r in rows: + by_var[r["variant"]].append(r) + for v in ["original"] + SURFACE_VARIANTS + ["kernel_variant"]: + rs = by_var.get(v, []) + if not rs: + continue + m_sc = statistics.fmean(r["sc_rate"] for r in rs) * 100 + m_sc_c = statistics.fmean(r["n_correct_sc_rate"] for r in rs) * 100 + m_sc_w = statistics.fmean(r["n_wrong_sc_rate"] for r in rs) * 100 + asym = m_sc_w - m_sc_c + print(f"{v:<24} {m_sc:>9.1f}% {m_sc_c:>13.1f}% {m_sc_w:>11.1f}% {asym:>+21.1f}pp") + + # Per-model leader board + print(f"\n{'Model':<22} {'mean SC% (all variants)':>26}") + print("-" * 50) + by_model = defaultdict(list) + for r in rows: + by_model[r["model"]].append(r["sc_rate"]) + model_avgs = sorted([(m, statistics.fmean(vs) * 100) for m, vs in by_model.items()], + key=lambda t: -t[1]) + for m, avg in model_avgs: + print(f"{m:<22} {avg:>25.1f}%") + + return rows + + +# ---------- Source B: rescue trajectories ---------- + +def analyze_rescue(): + path = THIS_DIR / "rescue_results/rescue_30.jsonl" + rows = [json.loads(l) for l in open(path)] + print(f"\n\n=== SELF-CORRECTION IN 1{{,}}529 RESCUE TRAJECTORIES ===\n") + + # Group by (model, variant, condition, grade) + counts = defaultdict(lambda: {"n": 0, "sc": 0}) + for r in rows: + text = r.get("student_solution") or "" + if not text: + continue + key = (r["model"], r["variant"], r["condition"], r.get("grade")) + counts[key]["n"] += 1 + if has_self_correction(text): + counts[key]["sc"] += 1 + + # Aggregate per (variant, condition, grade) + by_vcg = defaultdict(lambda: {"n": 0, "sc": 0}) + for k, d in counts.items(): + m, v, c, g = k + by_vcg[(v, c, g)]["n"] += d["n"] + by_vcg[(v, c, g)]["sc"] += d["sc"] + + print(f"{'Variant':<24} {'Condition':<14} {'CORRECT-SC%':>14} {'INCORRECT-SC%':>16}") + print("-" * 80) + for v in ["descriptive_long","descriptive_long_confusing","descriptive_long_misleading","garbled_string","kernel_variant"]: + for c in ["null", "canonical_T2", "own_T2"]: + cor = by_vcg.get((v, c, "CORRECT"), {"n": 0, "sc": 0}) + inc = by_vcg.get((v, c, "INCORRECT"), {"n": 0, "sc": 0}) + if cor["n"] == 0 and inc["n"] == 0: + continue + sc_c = cor["sc"] / max(1, cor["n"]) * 100 if cor["n"] else 0 + sc_i = inc["sc"] / max(1, inc["n"]) * 100 if inc["n"] else 0 + print(f"{v:<24} {c:<14} {sc_c:>11.1f}% (n={cor['n']:>3}) {sc_i:>13.1f}% (n={inc['n']:>3})") + print() + + return counts + + +def main(): + rows_18 = analyze_18_models() + json.dump(rows_18, open(THIS_DIR / "self_correction_18models.json", "w"), indent=2) + counts_rescue = analyze_rescue() + print("\nSaved -> analysis/self_correction_18models.json") + + +if __name__ == "__main__": + main() diff --git a/analysis/spotcheck_clean.py b/analysis/spotcheck_clean.py new file mode 100644 index 0000000..52ddc43 --- /dev/null +++ b/analysis/spotcheck_clean.py @@ -0,0 +1,181 @@ +"""Spot-check Unicode cleaning by side-by-side comparison. + +For a stratified sample of problems, load: + - the ORIGINAL kernel_variant.solution from the backup tarball + - the CLEANED kernel_variant.solution from the current dataset +and print them side-by-side so the user can verify that the cleaner +preserved meaning. + +Sampling strategy: + - 5 most complex (by original Unicode count) — stress test + - 3 medium complexity — typical case + - 2 surface-variant samples — to confirm rename + LaTeX preserved +""" +from __future__ import annotations +import json +import sys +import tarfile +from pathlib import Path + +CURRENT_DIR = Path("/home/yurenh2/gap/putnam-bench-anon/dataset") +BACKUP_TAR = sorted(Path("/home/yurenh2/gap/analysis/dataset_backups").glob( + "putnam-bench-anon_dataset_*.tar.gz"))[-1] + + +def count_unicode(text: str) -> int: + return sum(1 for c in (text or "") if ord(c) > 127) + + +def load_backup_problems(): + """Yield (idx, problem_dict) from the backup tarball.""" + with tarfile.open(BACKUP_TAR, "r:gz") as tar: + for member in tar.getmembers(): + if not member.isfile() or not member.name.endswith(".json"): + continue + f = tar.extractfile(member) + if not f: + continue + try: + d = json.load(f) + yield d.get("index"), d + except Exception: + continue + + +def main(): + print(f"Backup tar: {BACKUP_TAR}") + print("Building Unicode-count index over 1051 problems ...") + + # Index originals by Unicode count in kernel_variant.solution + by_uni_count = [] # (unicode_count, idx, solution_len) + backup_data = {} + for idx, d in load_backup_problems(): + if not idx: + continue + backup_data[idx] = d + kv_sol = (d.get("variants") or {}).get("kernel_variant", {}).get("solution", "") + uc = count_unicode(kv_sol) + by_uni_count.append((uc, idx, len(kv_sol))) + + by_uni_count.sort(reverse=True) + print(f" loaded {len(backup_data)} problems from backup") + + # Pick samples + samples = [] + samples.extend([(idx, "TOP COMPLEXITY") for _, idx, _ in by_uni_count[:5]]) + mid = len(by_uni_count) // 2 + samples.extend([(idx, "MEDIUM COMPLEXITY") + for _, idx, _ in by_uni_count[mid:mid + 3]]) + # Bottom = least Unicode but still non-zero + nonzero = [t for t in by_uni_count if t[0] > 0] + samples.extend([(idx, "LOW COMPLEXITY") + for _, idx, _ in nonzero[-2:]]) + + print(f"\nSelected {len(samples)} samples:\n") + for idx, label in samples: + print(f" {label:<20} {idx}") + + print("\n" + "=" * 80) + print("SIDE-BY-SIDE SPOT-CHECK") + print("=" * 80) + + for case_idx, (idx, label) in enumerate(samples, 1): + print(f"\n{'#' * 80}") + print(f"# CASE {case_idx}/{len(samples)}: {idx} ({label})") + print(f"{'#' * 80}") + + backup_problem = backup_data.get(idx) + current_path = CURRENT_DIR / f"{idx}.json" + if not backup_problem or not current_path.exists(): + print(f" ! missing data for {idx}") + continue + current_problem = json.load(open(current_path)) + + # Compare kernel_variant.solution by default. For LOW COMPLEXITY cases + # we also show the original `solution` field if it differs. + for field_path in [("variants", "kernel_variant", "solution")]: + orig_text = backup_problem + curr_text = current_problem + for key in field_path: + orig_text = (orig_text or {}).get(key) if isinstance(orig_text, dict) else None + curr_text = (curr_text or {}).get(key) if isinstance(curr_text, dict) else None + if not orig_text and not curr_text: + continue + orig_text = orig_text or "" + curr_text = curr_text or "" + field_label = ".".join(field_path) + uni_before = count_unicode(orig_text) + uni_after = count_unicode(curr_text) + len_before = len(orig_text) + len_after = len(curr_text) + print(f"\n--- field: {field_label} ---") + print(f" before: {len_before} chars, {uni_before} non-ASCII") + print(f" after: {len_after} chars, {uni_after} non-ASCII " + f"(Δ len {len_after - len_before:+d})") + print(f"\n >>> ORIGINAL (first 600 chars) <<<") + print(" " + orig_text[:600].replace("\n", "\n ")) + print(f"\n >>> CLEANED (first 600 chars) <<<") + print(" " + curr_text[:600].replace("\n", "\n ")) + + if uni_after > 0: + print(f" !!! WARNING: cleaned output still has {uni_after} non-ASCII chars") + + # Sanity: are LaTeX braces balanced in the cleaned text? + n_open = curr_text.count("{") + n_close = curr_text.count("}") + n_lparen = curr_text.count("(") + n_rparen = curr_text.count(")") + n_lbrack = curr_text.count("[") + n_rbrack = curr_text.count("]") + print(f" brace balance: {{ {n_open} | }} {n_close} " + f"( {n_lparen} | ) {n_rparen} " + f"[ {n_lbrack} | ] {n_rbrack}") + + # Final aggregate balance check across the entire cleaned dataset + print("\n" + "=" * 80) + print("AGGREGATE BRACE BALANCE CHECK (entire cleaned dataset)") + print("=" * 80) + total_diff_brace = 0 + total_diff_paren = 0 + total_diff_brack = 0 + files_with_brace_imbalance = 0 + files_with_paren_imbalance = 0 + files_with_brack_imbalance = 0 + for f in sorted(CURRENT_DIR.glob("*.json")): + d = json.load(open(f)) + # Concatenate all text fields + bag = [] + for k in ("question", "solution"): + bag.append(d.get(k) or "") + for vk, vd in (d.get("variants") or {}).items(): + if isinstance(vd, dict): + for k in ("question", "solution"): + bag.append(vd.get(k) or "") + all_text = "\n".join(bag) + diff_brace = all_text.count("{") - all_text.count("}") + diff_paren = all_text.count("(") - all_text.count(")") + diff_brack = all_text.count("[") - all_text.count("]") + if diff_brace != 0: + files_with_brace_imbalance += 1 + total_diff_brace += abs(diff_brace) + if diff_paren != 0: + files_with_paren_imbalance += 1 + total_diff_paren += abs(diff_paren) + if diff_brack != 0: + files_with_brack_imbalance += 1 + total_diff_brack += abs(diff_brack) + + print(f" files with unbalanced {{...}}: {files_with_brace_imbalance}/1051" + f" (total |Δ| = {total_diff_brace})") + print(f" files with unbalanced (...): {files_with_paren_imbalance}/1051" + f" (total |Δ| = {total_diff_paren})") + print(f" files with unbalanced [...]: {files_with_brack_imbalance}/1051" + f" (total |Δ| = {total_diff_brack})") + print() + print(" (Imbalance is not necessarily a bug — math text often legitimately") + print(" contains unbalanced delimiters in display formulas; this is just") + print(" an order-of-magnitude check.)") + + +if __name__ == "__main__": + main() diff --git a/analysis/structural_overlap.py b/analysis/structural_overlap.py new file mode 100644 index 0000000..284c139 --- /dev/null +++ b/analysis/structural_overlap.py @@ -0,0 +1,523 @@ +"""Stable-vs-Brittle structural overlap analysis (label-free). + +Pipeline: +1. For each (model, surface_variant) cell, load original and variant trajectories. +2. Pull the deterministic rename map from /home/yurenh2/gap/putnam-bench-anon/dataset/. +3. Canonicalize both trajectories: replace variant variables with placeholders + (via inverse rename map). Original trajectory: replace canonical variables + with the same placeholders. Both texts then live in a shared placeholder space. +4. Compute multiple non-LLM structural metrics on (orig_canonical, var_canonical): + - Token Jaccard + - Bigram Jaccard + - Equation-set Jaccard (math-block extraction) + - Prefix Jaccard (first 30% of each canonical text) +5. Stratify by group (stable vs brittle) within each (model, variant) cell. +6. Mann-Whitney U test on each metric for stable vs brittle. + +Surface variants only (rename map available). Kernel handled separately. +""" + +from __future__ import annotations +import json +import re +import os +from pathlib import Path +from collections import Counter, defaultdict +from typing import Dict, List, Tuple, Optional + +import statistics + +DATASET_DIR = Path("/home/yurenh2/gap/putnam-bench-anon/dataset") +RESULTS_DIR = Path("/home/yurenh2/gap/results_new") + +SURFACE_VARIANTS = ["descriptive_long", "descriptive_long_confusing", + "descriptive_long_misleading", "garbled_string"] + + +# ---------- I/O helpers ---------- + +def load_problems(path: Path) -> List[dict]: + d = json.load(open(path)) + return d.get("problems") or d.get("detailed_results") or [] + + +def find_variant_file(model_dir: Path, variant: str) -> Optional[Path]: + files = sorted(os.listdir(model_dir)) + cands = [f for f in files + if f.endswith(f"_{variant}.json") + and "regraded" not in f and "comparison" not in f + and not f.endswith(f"_{variant}2.json")] + if not cands and variant == "garbled_string": + cands = [f for f in files if f.endswith("_gs.json")] + return model_dir / cands[0] if cands else None + + +def load_dataset_maps() -> Dict[str, Dict[str, Dict[str, str]]]: + """Returns: {problem_index: {variant: {orig_var_name: variant_var_name}}}""" + out = {} + for f in sorted(DATASET_DIR.glob("*.json")): + d = json.load(open(f)) + idx = d.get("index") + variants = d.get("variants", {}) + cell = {} + for v in SURFACE_VARIANTS: + vd = variants.get(v, {}) + mp_str = vd.get("map") + if isinstance(mp_str, str): + # The map is stored as a Python repr string; eval it safely + try: + mp = eval(mp_str, {"__builtins__": {}}, {}) + if isinstance(mp, dict): + cell[v] = {str(k): str(v) for k, v in mp.items()} + except Exception: + pass + elif isinstance(mp_str, dict): + cell[v] = {str(k): str(v) for k, v in mp_str.items()} + out[idx] = cell + return out + + +# ---------- Canonicalization ---------- + +def canonicalize_text(text: str, var_to_placeholder: Dict[str, str]) -> str: + """Replace each variable name in text with its canonical placeholder. + + Sort by length desc to avoid prefix collisions (e.g., 'xs' before 'x'). + Use word-boundary regex for ASCII-identifier-like names; literal replace + for non-identifier names (like garbled strings, which are also alpha). + """ + if not text: + return "" + # Sort longest-first to avoid 'al' eating into 'almondtree' + items = sorted(var_to_placeholder.items(), key=lambda kv: -len(kv[0])) + out = text + for var, ph in items: + if not var: + continue + # Use word-boundary so we only replace whole tokens. Variables in this + # dataset are all alphanumeric. + pat = r"(?<![A-Za-z0-9_])" + re.escape(var) + r"(?![A-Za-z0-9_])" + out = re.sub(pat, ph, out) + return out + + +def normalize_whitespace(text: str) -> str: + return re.sub(r"\s+", " ", text).strip() + + +# ---------- Tokenization ---------- + +_TOKEN_RE = re.compile(r"[A-Za-z_][A-Za-z0-9_]*|\d+|[^\sA-Za-z0-9_]") + +def tokens(text: str) -> List[str]: + return _TOKEN_RE.findall(text or "") + + +def bigrams(toks: List[str]) -> List[str]: + return [f"{toks[i]} {toks[i+1]}" for i in range(len(toks) - 1)] + + +# ---------- Math block extraction ---------- + +_MATH_BLOCKS = [ + re.compile(r"\$\$(.+?)\$\$", re.DOTALL), + re.compile(r"\\\[(.+?)\\\]", re.DOTALL), + re.compile(r"\$(.+?)\$", re.DOTALL), + re.compile(r"\\begin\{(?:equation|align|gather)\*?\}(.+?)\\end\{(?:equation|align|gather)\*?\}", re.DOTALL), +] + +def extract_math_blocks(text: str, min_len: int = 8) -> List[str]: + found = [] + for pat in _MATH_BLOCKS: + found.extend(pat.findall(text or "")) + # Lightweight normalization: collapse whitespace, strip + out = [normalize_whitespace(b) for b in found if b.strip()] + # Filter trivial fragments like '$n$', '$0$', '$x$' that saturate Jaccard + return [b for b in out if len(b) >= min_len] + + +# ---------- Metrics ---------- + +def jaccard(a: set, b: set) -> float: + if not a and not b: + return 1.0 + return len(a & b) / max(1, len(a | b)) + + +def metric_token_jaccard(a: str, b: str) -> float: + return jaccard(set(tokens(a)), set(tokens(b))) + + +def metric_bigram_jaccard(a: str, b: str) -> float: + return jaccard(set(bigrams(tokens(a))), set(bigrams(tokens(b)))) + + +def metric_prefix_token_jaccard(a: str, b: str, frac: float = 0.3) -> float: + """Jaccard over the first frac of tokens from each side.""" + ta, tb = tokens(a), tokens(b) + na, nb = max(1, int(len(ta) * frac)), max(1, int(len(tb) * frac)) + return jaccard(set(ta[:na]), set(tb[:nb])) + + +def metric_prefix_bigram_jaccard(a: str, b: str, frac: float = 0.3) -> float: + ta, tb = tokens(a), tokens(b) + na, nb = max(1, int(len(ta) * frac)), max(1, int(len(tb) * frac)) + return jaccard(set(bigrams(ta[:na])), set(bigrams(tb[:nb]))) + + +def metric_equation_jaccard(a: str, b: str) -> float: + ea = set(extract_math_blocks(a)) + eb = set(extract_math_blocks(b)) + return jaccard(ea, eb) + + +def metric_lcp_tokens(a: str, b: str) -> int: + """Length of the longest common prefix of canonicalized token streams. + + Directly tests Codex's thesis 'early loss of structural overlap with the + model's own original reasoning under renaming'. Larger LCP -> the model + started its variant trajectory the same way it started the original. + """ + ta, tb = tokens(a), tokens(b) + n = min(len(ta), len(tb)) + i = 0 + while i < n and ta[i] == tb[i]: + i += 1 + return i + + +def metric_lcp_normalized(a: str, b: str) -> float: + """LCP length normalized by the shorter trajectory length, in [0, 1].""" + ta, tb = tokens(a), tokens(b) + n = min(len(ta), len(tb)) + if n == 0: + return 0.0 + return metric_lcp_tokens(a, b) / n + + +def metric_lcp_first1k(a: str, b: str) -> float: + """LCP length capped to first-1000-token comparison, normalized to [0, 1].""" + ta, tb = tokens(a), tokens(b) + ta, tb = ta[:1000], tb[:1000] + n = min(len(ta), len(tb)) + if n == 0: + return 0.0 + i = 0 + while i < n and ta[i] == tb[i]: + i += 1 + return i / n + + +def metric_directional_coverage(a: str, b: str) -> float: + """|tokens_a ∩ tokens_b| / |tokens_a|. Length-asymmetric. + + Reads as: 'what fraction of the original's vocabulary survives in the variant?' + More robust to length differences than symmetric Jaccard. + """ + ta = set(tokens(a)) + tb = set(tokens(b)) + if not ta: + return 0.0 + return len(ta & tb) / len(ta) + + +def metric_window_token_jaccard(a: str, b: str, window: int = 600) -> float: + """Jaccard restricted to the first `window` tokens on each side.""" + ta = tokens(a)[:window] + tb = tokens(b)[:window] + return jaccard(set(ta), set(tb)) + + +def metric_window_bigram_jaccard(a: str, b: str, window: int = 600) -> float: + ta = tokens(a)[:window] + tb = tokens(b)[:window] + return jaccard(set(bigrams(ta)), set(bigrams(tb))) + + +# ---------- Stat helpers ---------- + +def bootstrap_ci_delta_median(xs: List[float], ys: List[float], + n_iter: int = 1000, seed: int = 0) -> Tuple[float, float]: + """Percentile bootstrap 95% CI on median(xs) - median(ys).""" + import random + rng = random.Random(seed) + if not xs or not ys: + return float("nan"), float("nan") + ds = [] + for _ in range(n_iter): + rs = [xs[rng.randrange(len(xs))] for _ in range(len(xs))] + rb = [ys[rng.randrange(len(ys))] for _ in range(len(ys))] + ds.append(statistics.median(rs) - statistics.median(rb)) + ds.sort() + lo = ds[int(0.025 * n_iter)] + hi = ds[int(0.975 * n_iter)] + return lo, hi + + +def bootstrap_ci_cohens_d(xs: List[float], ys: List[float], + n_iter: int = 1000, seed: int = 0) -> Tuple[float, float]: + import random + rng = random.Random(seed) + if len(xs) < 2 or len(ys) < 2: + return float("nan"), float("nan") + ds = [] + for _ in range(n_iter): + rs = [xs[rng.randrange(len(xs))] for _ in range(len(xs))] + rb = [ys[rng.randrange(len(ys))] for _ in range(len(ys))] + sm, bm = statistics.fmean(rs), statistics.fmean(rb) + ssd = statistics.pstdev(rs) + bsd = statistics.pstdev(rb) + pooled = (((len(rs)-1)*ssd**2 + (len(rb)-1)*bsd**2) + / max(1, len(rs)+len(rb)-2)) ** 0.5 + if pooled > 0: + ds.append((sm - bm) / pooled) + if not ds: + return float("nan"), float("nan") + ds.sort() + lo = ds[int(0.025 * len(ds))] + hi = ds[int(0.975 * len(ds))] + return lo, hi + + +def mann_whitney_u(xs: List[float], ys: List[float]) -> Tuple[float, float]: + """Returns (U, normal_approx_p_two_sided). Pure-python, no scipy. + + Used only as a screening signal — for the rebuttal we'll use scipy if + available; this is a fallback so we don't add a dependency. + """ + n1, n2 = len(xs), len(ys) + if n1 == 0 or n2 == 0: + return float("nan"), float("nan") + combined = [(v, 0) for v in xs] + [(v, 1) for v in ys] + combined.sort(key=lambda t: t[0]) + # Average ranks for ties + ranks = [0.0] * len(combined) + i = 0 + while i < len(combined): + j = i + while j + 1 < len(combined) and combined[j + 1][0] == combined[i][0]: + j += 1 + avg = (i + j) / 2.0 + 1 # 1-indexed + for k in range(i, j + 1): + ranks[k] = avg + i = j + 1 + R1 = sum(ranks[k] for k in range(len(combined)) if combined[k][1] == 0) + U1 = R1 - n1 * (n1 + 1) / 2.0 + U2 = n1 * n2 - U1 + U = min(U1, U2) + # Normal approx (no tie correction) + mu = n1 * n2 / 2.0 + sd = (n1 * n2 * (n1 + n2 + 1) / 12.0) ** 0.5 + if sd == 0: + return U, float("nan") + z = (U - mu) / sd + # Two-sided p via erf approx + import math + p = math.erfc(abs(z) / math.sqrt(2)) + return U, p + + +# ---------- Cell analysis ---------- + +COLLAPSE_MIN_CHARS = 200 +COLLAPSE_RATIO = 0.25 # variant_len < ratio * orig_len => collapse + + +def is_collapse(orig_text: str, var_text: str) -> bool: + return (len(var_text) < COLLAPSE_MIN_CHARS + or len(var_text) < COLLAPSE_RATIO * max(1, len(orig_text))) + + +def analyze_cell(model_name: str, variant: str, dataset_maps: dict, + model_dir: Path) -> Optional[dict]: + orig_path = find_variant_file(model_dir, "original") + var_path = find_variant_file(model_dir, variant) + if not orig_path or not var_path: + return None + + orig_by = {p["index"]: p for p in load_problems(orig_path)} + var_by = {p["index"]: p for p in load_problems(var_path)} + + common = set(orig_by) & set(var_by) + pairs_stable_drift = [] # (orig_canon, var_canon, problem_type) — non-collapse + pairs_brittle_drift = [] # non-collapse brittle + pairs_brittle_collapse = [] # short variant text + n_stable_collapse = 0 # almost always 0 but tracked for completeness + + for idx in common: + po, pv = orig_by[idx], var_by[idx] + if po.get("correct") is not True: + continue + var_correct = pv.get("correct") + if var_correct is None: + continue + orig_text = (po.get("solve") or {}).get("solution") or "" + var_text = (pv.get("solve") or {}).get("solution") or "" + if not orig_text or not var_text: + continue + rmap = dataset_maps.get(idx, {}).get(variant) + if not rmap: + continue + # Canonicalize + canon_to_ph = {k: f"__V{i}__" for i, k in enumerate(rmap.keys())} + var_to_ph = {rmap[k]: canon_to_ph[k] for k in rmap} + orig_canon = canonicalize_text(orig_text, canon_to_ph) + var_canon = canonicalize_text(var_text, var_to_ph) + sample = { + "index": idx, + "problem_type": po.get("problem_type"), + "orig_canon": orig_canon, + "var_canon": var_canon, + "orig_len": len(orig_text), + "var_len": len(var_text), + } + collapse = is_collapse(orig_text, var_text) + if var_correct is True: + if collapse: + n_stable_collapse += 1 + else: + pairs_stable_drift.append(sample) + else: + if collapse: + pairs_brittle_collapse.append(sample) + else: + pairs_brittle_drift.append(sample) + + if not pairs_stable_drift or not pairs_brittle_drift: + return None + + metrics = { + "token_jaccard": metric_token_jaccard, + "bigram_jaccard": metric_bigram_jaccard, + "directional_coverage": metric_directional_coverage, + "window_token_jaccard": metric_window_token_jaccard, + "window_bigram_jaccard": metric_window_bigram_jaccard, + "equation_jaccard": metric_equation_jaccard, + } + # Headline metric for bootstrap + noise floor (the others stay descriptive) + HEADLINE = "token_jaccard" + + # Pre-tokenize once per pair to amortize cost (used by token/bigram/window metrics). + for p in pairs_stable_drift + pairs_brittle_drift: + p["_otok"] = tokens(p["orig_canon"]) + p["_vtok"] = tokens(p["var_canon"]) + p["_oset"] = set(p["_otok"]) + p["_vset"] = set(p["_vtok"]) + + def fast_token_jaccard(p): + a, b = p["_oset"], p["_vset"] + if not a and not b: + return 1.0 + return len(a & b) / max(1, len(a | b)) + + def fast_token_jaccard_pair(pa, pb): + a, b = pa["_oset"], pb["_vset"] + if not a and not b: + return 1.0 + return len(a & b) / max(1, len(a | b)) + + out = { + "model": model_name, + "variant": variant, + "n_stable_drift": len(pairs_stable_drift), + "n_brittle_drift": len(pairs_brittle_drift), + "n_stable_collapse": n_stable_collapse, + "n_brittle_collapse": len(pairs_brittle_collapse), + "brittle_collapse_rate": (len(pairs_brittle_collapse) + / max(1, len(pairs_brittle_collapse) + len(pairs_brittle_drift))), + "metrics": {}, + } + # Compute all descriptive metrics (one pass per pair, no bootstrap) + for mname, mfn in metrics.items(): + s_vals = [mfn(p["orig_canon"], p["var_canon"]) for p in pairs_stable_drift] + b_vals = [mfn(p["orig_canon"], p["var_canon"]) for p in pairs_brittle_drift] + U, p = mann_whitney_u(s_vals, b_vals) + sm, bm = statistics.fmean(s_vals), statistics.fmean(b_vals) + ssd = statistics.pstdev(s_vals) if len(s_vals) > 1 else 0 + bsd = statistics.pstdev(b_vals) if len(b_vals) > 1 else 0 + pooled = (((len(s_vals)-1)*ssd**2 + (len(b_vals)-1)*bsd**2) + / max(1, len(s_vals)+len(b_vals)-2)) ** 0.5 + d = (sm - bm) / pooled if pooled > 0 else 0.0 + out["metrics"][mname] = { + "stable_median": statistics.median(s_vals), + "stable_mean": sm, + "brittle_median": statistics.median(b_vals), + "brittle_mean": bm, + "delta_median": statistics.median(s_vals) - statistics.median(b_vals), + "delta_mean": sm - bm, + "cohens_d": d, + "U": U, + "p_two_sided": p, + } + + # Bootstrap + noise floor only on headline metric + s_vals = [fast_token_jaccard(p) for p in pairs_stable_drift] + b_vals = [fast_token_jaccard(p) for p in pairs_brittle_drift] + ci_lo, ci_hi = bootstrap_ci_delta_median(s_vals, b_vals, n_iter=400) + d_lo, d_hi = bootstrap_ci_cohens_d(s_vals, b_vals, n_iter=400) + out["metrics"][HEADLINE]["delta_median_ci"] = [ci_lo, ci_hi] + out["metrics"][HEADLINE]["cohens_d_ci"] = [d_lo, d_hi] + + # Random-pairing noise floor for headline: pair stable orig with random other-problem variant + import random as _r + rng = _r.Random(42) + nf_vals = [] + n = len(pairs_stable_drift) + if n >= 2: + for _ in range(min(400, n * (n - 1))): + i = rng.randrange(n) + j = rng.randrange(n) + while j == i: + j = rng.randrange(n) + nf_vals.append(fast_token_jaccard_pair(pairs_stable_drift[i], + pairs_stable_drift[j])) + out["metrics"][HEADLINE]["noise_floor_median"] = ( + statistics.median(nf_vals) if nf_vals else None) + out["metrics"][HEADLINE]["noise_floor_mean"] = ( + statistics.fmean(nf_vals) if nf_vals else None) + out["metrics"][HEADLINE]["noise_floor_n"] = len(nf_vals) + return out + + +def main(): + print("Loading dataset rename maps ...", flush=True) + dataset_maps = load_dataset_maps() + print(f" loaded {len(dataset_maps)} problems", flush=True) + + # Multi-cell sweep across all models × 4 surface variants + # Run all 18 models — non-LLM, fast. + all_models = sorted([d.name for d in RESULTS_DIR.iterdir() if d.is_dir()]) + print(f"Models: {all_models}") + all_results = [] + + print(f"\n{'Cell':<46} {'nSd':>4} {'nBd':>4} {'col%':>5} " + f"{'sMed':>6} {'bMed':>6} {'nfMed':>6} " + f"{'d':>6} {'d95CI':>14} {'p':>9}") + print("-" * 122) + + for m in all_models: + for v in SURFACE_VARIANTS: + mdir = RESULTS_DIR / m + if not mdir.exists(): + continue + res = analyze_cell(m, v, dataset_maps, mdir) + if res is None: + continue + all_results.append(res) + md = res["metrics"]["token_jaccard"] + label = f"{m} / {v}" + ci_lo, ci_hi = md["cohens_d_ci"] + ci_str = f"[{ci_lo:+.2f}, {ci_hi:+.2f}]" + print(f"{label:<46} {res['n_stable_drift']:>4} {res['n_brittle_drift']:>4} " + f"{res['brittle_collapse_rate']*100:>4.0f}% " + f"{md['stable_median']:>6.3f} {md['brittle_median']:>6.3f} " + f"{md['noise_floor_median']:>6.3f} " + f"{md['cohens_d']:>+6.2f} {ci_str:>14} {md['p_two_sided']:>9.1e}") + + out_path = Path("/home/yurenh2/gap/analysis/structural_overlap_results.json") + json.dump(all_results, open(out_path, "w"), indent=2) + print(f"\nSaved -> {out_path} ({len(all_results)} cells)") + + +if __name__ == "__main__": + main() diff --git a/analysis/topic_problemtype_interaction.py b/analysis/topic_problemtype_interaction.py new file mode 100644 index 0000000..405b33a --- /dev/null +++ b/analysis/topic_problemtype_interaction.py @@ -0,0 +1,112 @@ +"""KV fragility broken down by Topic × Problem-type (proof vs calculation).""" +from __future__ import annotations +import json +import sys +import statistics +from pathlib import Path +from collections import defaultdict + +THIS_DIR = Path(__file__).resolve().parent +sys.path.insert(0, str(THIS_DIR)) +from structural_overlap import find_variant_file, load_problems, RESULTS_DIR, SURFACE_VARIANTS + +DATASET_DIR = Path("/home/yurenh2/gap/putnam-bench-anon/dataset") + + +def load_metadata(): + out = {} + for f in sorted(DATASET_DIR.glob("*.json")): + d = json.load(open(f)) + idx = d.get("index") + if not idx: continue + out[idx] = { + "tag": d.get("tag"), + "problem_type": d.get("problem_type"), + } + return out + + +def main(): + metadata = load_metadata() + base = RESULTS_DIR + models = sorted([d.name for d in base.iterdir() if d.is_dir()]) + + # cells[(topic, ptype, model, variant)] = (n, n_correct) + cells = defaultdict(lambda: [0, 0]) + for m in models: + mdir = base / m + for v in ["original"] + SURFACE_VARIANTS + ["kernel_variant"]: + vp = find_variant_file(mdir, v) + if not vp: continue + for p in load_problems(vp): + idx = p.get("index") + correct = p.get("correct") + if idx is None or correct is None: continue + md = metadata.get(idx, {}) + tag = md.get("tag") + ptype = md.get("problem_type") + if not tag or not ptype: continue + tags = tag if isinstance(tag, list) else [tag] + for t in tags: + if t not in ["ALG", "ANA", "NT", "COMB", "GEO"]: continue + cells[(t, ptype, m, v)][0] += 1 + if correct: cells[(t, ptype, m, v)][1] += 1 + + print("=" * 80) + print("ACCURACY BY TOPIC × PROBLEM-TYPE × VARIANT (mean across 18 models)") + print("=" * 80) + print() + + for ptype in ["proof", "calculation"]: + print(f"\n--- {ptype.upper()} ---\n") + print(f"{'Topic':<6}", end="") + for v in ["original", "garbled_string", "kernel_variant"]: + short = {"original":"orig","garbled_string":"GS","kernel_variant":"KV"}[v] + print(f" {short:>6}", end="") + print(f" {'Δ_GS':>7} {'Δ_KV':>7}") + print("-" * 50) + for t in ["ALG", "ANA", "NT", "COMB", "GEO"]: + orig_rates = [] + gs_rates = [] + kv_rates = [] + for m in models: + no, co = cells.get((t, ptype, m, "original"), [0, 0]) + ng, cg = cells.get((t, ptype, m, "garbled_string"), [0, 0]) + nk, ck = cells.get((t, ptype, m, "kernel_variant"), [0, 0]) + if no >= 5 and ng >= 5 and nk >= 5: + orig_rates.append(co / no) + gs_rates.append(cg / ng) + kv_rates.append(ck / nk) + if not orig_rates: continue + mo = statistics.fmean(orig_rates) * 100 + mg = statistics.fmean(gs_rates) * 100 + mk = statistics.fmean(kv_rates) * 100 + print(f"{t:<6} {mo:>5.1f}% {mg:>5.1f}% {mk:>5.1f}% {mg-mo:>+5.1f}pp {mk-mo:>+5.1f}pp") + + print("\n\n=== KEY DIFFERENTIAL: Δ KV by Topic for proof vs calculation ===\n") + print(f"{'Topic':<6} {'proof Δ':>10} {'calc Δ':>10} {'(calc - proof)':>16}") + print("-" * 50) + for t in ["ALG", "ANA", "NT", "COMB", "GEO"]: + deltas = {} + for ptype in ["proof", "calculation"]: + orig_rates = [] + kv_rates = [] + for m in models: + no, co = cells.get((t, ptype, m, "original"), [0, 0]) + nk, ck = cells.get((t, ptype, m, "kernel_variant"), [0, 0]) + if no >= 5 and nk >= 5: + orig_rates.append(co / no) + kv_rates.append(ck / nk) + if orig_rates: + deltas[ptype] = (statistics.fmean(kv_rates) - statistics.fmean(orig_rates)) * 100 + if "proof" in deltas and "calculation" in deltas: + diff = deltas["calculation"] - deltas["proof"] + print(f"{t:<6} {deltas['proof']:>+9.1f}pp {deltas['calculation']:>+9.1f}pp {diff:>+15.1f}pp") + elif "proof" in deltas: + print(f"{t:<6} {deltas['proof']:>+9.1f}pp {'-':>10} {'-':>16}") + elif "calculation" in deltas: + print(f"{t:<6} {'-':>10} {deltas['calculation']:>+9.1f}pp {'-':>16}") + + +if __name__ == "__main__": + main() diff --git a/analysis/unicode_audit.py b/analysis/unicode_audit.py new file mode 100644 index 0000000..afe5679 --- /dev/null +++ b/analysis/unicode_audit.py @@ -0,0 +1,238 @@ +"""Unicode audit for PutnamGAP dataset. + +Scans all JSON files in the dataset, finds all non-ASCII characters in text +fields (question, solution across all variants), and reports: + +1. How many files contain Unicode +2. Top Unicode characters by total frequency with suggested LaTeX replacements +3. Which fields are most affected +4. Per-file tallies +5. Samples of lines showing each unusual character in context +6. A machine-readable JSON report for downstream cleaning + +Does NOT modify any file. Read-only audit. +""" +from __future__ import annotations +import json +import sys +import unicodedata +from pathlib import Path +from collections import defaultdict, Counter + +# Both copies of the dataset +DIRS = [ + Path("/home/yurenh2/gap/putnam-bench-anon/dataset"), + Path("/home/yurenh2/gap/putnamsup/PutnamGAP"), +] + +# Text-bearing fields we care about +TOP_LEVEL_TEXT_FIELDS = ["question", "solution"] +VARIANT_TEXT_FIELDS = ["question", "solution"] +VARIANT_KEYS = [ + "descriptive_long", + "descriptive_long_confusing", + "descriptive_long_misleading", + "garbled_string", + "kernel_variant", + "original_kernel_variant", +] + +# Suggested LaTeX replacements for common math Unicode. (Informational — the +# audit does not apply these.) Each entry is (unicode_char, latex_suggestion). +SUGGESTED_LATEX = { + # Greek lower case + "α": r"\alpha", "β": r"\beta", "γ": r"\gamma", "δ": r"\delta", + "ε": r"\varepsilon", "ζ": r"\zeta", "η": r"\eta", "θ": r"\theta", + "ι": r"\iota", "κ": r"\kappa", "λ": r"\lambda", "μ": r"\mu", + "ν": r"\nu", "ξ": r"\xi", "π": r"\pi", "ρ": r"\rho", "σ": r"\sigma", + "τ": r"\tau", "υ": r"\upsilon", "φ": r"\varphi", "χ": r"\chi", + "ψ": r"\psi", "ω": r"\omega", + # Greek upper case + "Α": "A", "Β": "B", "Γ": r"\Gamma", "Δ": r"\Delta", "Ε": "E", + "Ζ": "Z", "Η": "H", "Θ": r"\Theta", "Λ": r"\Lambda", "Ξ": r"\Xi", + "Π": r"\Pi", "Σ": r"\Sigma", "Φ": r"\Phi", "Ψ": r"\Psi", + "Ω": r"\Omega", + # Math operators & relations + "≤": r"\leq", "≥": r"\geq", "≠": r"\neq", "≈": r"\approx", + "≡": r"\equiv", "±": r"\pm", "∓": r"\mp", "×": r"\times", + "÷": r"\div", "·": r"\cdot", "∙": r"\cdot", + "∞": r"\infty", "∂": r"\partial", "∇": r"\nabla", "∆": r"\Delta", + "∑": r"\sum", "∏": r"\prod", "∫": r"\int", "√": r"\sqrt{}", + "∮": r"\oint", "∴": r"\therefore", "∵": r"\because", + "∈": r"\in", "∉": r"\notin", "⊂": r"\subset", "⊆": r"\subseteq", + "⊃": r"\supset", "⊇": r"\supseteq", "∪": r"\cup", "∩": r"\cap", + "∧": r"\land", "∨": r"\lor", "¬": r"\neg", + "→": r"\to", "←": r"\leftarrow", "↔": r"\leftrightarrow", + "⇒": r"\Rightarrow", "⇐": r"\Leftarrow", "⇔": r"\Leftrightarrow", + "⟨": r"\langle", "⟩": r"\rangle", "⌊": r"\lfloor", "⌋": r"\rfloor", + "⌈": r"\lceil", "⌉": r"\rceil", + "∅": r"\emptyset", "ℝ": r"\mathbb{R}", "ℂ": r"\mathbb{C}", + "ℕ": r"\mathbb{N}", "ℤ": r"\mathbb{Z}", "ℚ": r"\mathbb{Q}", + # Subscripts / superscripts (common ones only) + "₀": "_0", "₁": "_1", "₂": "_2", "₃": "_3", "₄": "_4", "₅": "_5", + "₆": "_6", "₇": "_7", "₈": "_8", "₉": "_9", + "⁰": "^0", "¹": "^1", "²": "^2", "³": "^3", "⁴": "^4", "⁵": "^5", + "⁶": "^6", "⁷": "^7", "⁸": "^8", "⁹": "^9", + "ₐ": "_a", "ᵢ": "_i", "ⱼ": "_j", "ₖ": "_k", "ₙ": "_n", + # Fractions + "½": r"\frac{1}{2}", "⅓": r"\frac{1}{3}", "⅔": r"\frac{2}{3}", + "¼": r"\frac{1}{4}", "¾": r"\frac{3}{4}", + # Punctuation / whitespace + "—": "---", "–": "--", "…": r"\ldots", + "‘": "`", "’": "'", "“": "``", "”": "''", + "°": r"^\circ", + "\u00A0": " (nbsp)", # non-breaking space + "\u2009": " (thin space)", + "\u200b": " (zero-width space)", + "\u2026": r"\ldots", + "\u2212": "-", # Unicode minus vs hyphen +} + + +def is_non_ascii(ch: str) -> bool: + return ord(ch) > 127 + + +def extract_text_fields(problem: dict): + """Yield (field_path, text) for every text-bearing field in a problem.""" + idx = problem.get("index", "?") + for k in TOP_LEVEL_TEXT_FIELDS: + v = problem.get(k) + if isinstance(v, str): + yield f"{idx}:{k}", v + for vk in VARIANT_KEYS: + vd = (problem.get("variants") or {}).get(vk) + if not isinstance(vd, dict): + continue + for k in VARIANT_TEXT_FIELDS: + v = vd.get(k) + if isinstance(v, str): + yield f"{idx}:variants.{vk}.{k}", v + + +def audit_dir(dataset_dir: Path, label: str): + print(f"\n{'=' * 76}") + print(f"Auditing {label}: {dataset_dir}") + print(f"{'=' * 76}") + + files = sorted(dataset_dir.glob("*.json")) + print(f"Files: {len(files)}") + + char_counter = Counter() # unicode char -> total occurrences + field_char_counter = defaultdict(Counter) # field_name -> Counter + files_with_unicode = set() # set of problem indices + per_field_counts = Counter() # {question, solution, variants.DL.question, ...} -> n files with unicode + examples = defaultdict(list) # char -> list of (context, path) + total_chars = 0 + total_unicode = 0 + + for f in files: + try: + d = json.load(open(f)) + except Exception as e: + print(f" ! {f.name}: JSON parse error: {e}") + continue + file_had_unicode = False + for path, text in extract_text_fields(d): + if not text: + continue + total_chars += len(text) + nas = [c for c in text if is_non_ascii(c)] + if not nas: + continue + file_had_unicode = True + total_unicode += len(nas) + # tally + for c in nas: + char_counter[c] += 1 + # short field label (strip problem index prefix) + short = path.split(":", 1)[1] + field_char_counter[short][c] += 1 + per_field_counts[short] += 1 + # collect up to 3 examples per char with ±20 char context + if len(examples[c]) < 3: + idx = text.find(c) + start = max(0, idx - 25) + end = min(len(text), idx + 25) + ctx = text[start:end].replace("\n", " ") + examples[c].append((ctx, path)) + if file_had_unicode: + files_with_unicode.add(d.get("index", f.name)) + + # Report + print(f"\nTotal characters scanned: {total_chars:,}") + print(f"Non-ASCII characters: {total_unicode:,} ({total_unicode/total_chars*100:.2f}%)") + print(f"Files with any Unicode: {len(files_with_unicode)}/{len(files)} " + f"({len(files_with_unicode)/len(files)*100:.1f}%)") + print(f"Distinct Unicode code points: {len(char_counter)}") + + print(f"\n--- Top 40 Unicode characters by frequency ---") + print(f"{'char':<6} {'hex':<8} {'count':>8} name / suggested LaTeX") + print("-" * 76) + for c, n in char_counter.most_common(40): + name = unicodedata.name(c, "?") + hex_val = f"U+{ord(c):04X}" + suggestion = SUGGESTED_LATEX.get(c, "") + display_c = c if c.isprintable() and ord(c) > 0x20 else repr(c) + print(f"{display_c:<6} {hex_val:<8} {n:>8} {name[:45]:<45} {suggestion}") + + # Per-field breakdown + print(f"\n--- Unicode per field (top 15 fields with most Unicode) ---") + print(f"{'field':<50} {'total unicode':>15}") + print("-" * 70) + for field, cnt in Counter({f: sum(c.values()) for f, c in field_char_counter.items()}).most_common(15): + print(f"{field:<50} {cnt:>15}") + + # Examples for top 10 chars + print(f"\n--- Example contexts for top 10 Unicode chars ---") + for c, n in char_counter.most_common(10): + name = unicodedata.name(c, "?") + display_c = c if c.isprintable() and ord(c) > 0x20 else repr(c) + print(f"\n {display_c} (U+{ord(c):04X}, {name}, n={n}):") + for ctx, path in examples[c][:2]: + print(f" [{path}]") + print(f" …{ctx}…") + + # Machine-readable summary + summary = { + "dataset_dir": str(dataset_dir), + "n_files": len(files), + "n_files_with_unicode": len(files_with_unicode), + "pct_files_with_unicode": 100 * len(files_with_unicode) / max(1, len(files)), + "total_chars": total_chars, + "total_unicode": total_unicode, + "distinct_codepoints": len(char_counter), + "top_chars": [ + {"char": c, "codepoint": f"U+{ord(c):04X}", + "name": unicodedata.name(c, "?"), + "count": n, + "suggested_latex": SUGGESTED_LATEX.get(c, ""), + "examples": [{"path": path, "context": ctx} + for ctx, path in examples[c][:3]]} + for c, n in char_counter.most_common(80) + ], + "per_field_unicode_counts": dict( + Counter({f: sum(c.values()) for f, c in field_char_counter.items()}) + .most_common(30)), + "files_with_unicode_indices": sorted(files_with_unicode), + } + return summary + + +def main(): + all_summaries = [] + for d in DIRS: + if d.exists(): + s = audit_dir(d, d.name) + s["label"] = d.name + all_summaries.append(s) + else: + print(f" (skipping missing dir {d})") + + out_path = Path("/home/yurenh2/gap/analysis/unicode_audit.json") + json.dump(all_summaries, open(out_path, "w"), indent=2, ensure_ascii=False) + print(f"\n\nSaved machine-readable summary -> {out_path}") + + +if __name__ == "__main__": + main() diff --git a/analysis/unicode_clean.py b/analysis/unicode_clean.py new file mode 100644 index 0000000..cea3cbe --- /dev/null +++ b/analysis/unicode_clean.py @@ -0,0 +1,729 @@ +"""Unicode -> LaTeX cleaner for PutnamGAP dataset (v2). + +Improvements over v1: + - Pre-normalize via NFKD then strip combining diacritics so accented + letters collapse to their ASCII base. + - Group adjacent subscript/superscript runs into {...}: x_1_0 -> x_{10}, + x^2^3 -> x^{23}. + - Wrap the argument of radical commands: \\sqrt-followed-by-X -> \\sqrt{X} + where X is either an identifier/number run or a balanced paren/bracket + group or a single \\-command (optionally followed by {...} arguments). + - Explicit replacements for symbols that previously fell through: + star, blacksquare/QED, fraction slash, dagger, etc. + - Deletes lone combining diacritics and decorative box-drawing characters. + +Operates IN PLACE on both dataset copies. Backup in a tarball first. +""" +from __future__ import annotations +import json +import re +import sys +import unicodedata +from pathlib import Path +from collections import Counter + +DIRS = [ + Path("/home/yurenh2/gap/putnam-bench-anon/dataset"), + Path("/home/yurenh2/gap/putnamsup/PutnamGAP"), +] + +TOP_LEVEL_TEXT_FIELDS = ["question", "solution"] +VARIANT_TEXT_FIELDS = ["question", "solution"] +VARIANT_KEYS = [ + "descriptive_long", + "descriptive_long_confusing", + "descriptive_long_misleading", + "garbled_string", + "kernel_variant", + "original_kernel_variant", +] + + +# Sentinels placed during char substitution, resolved in a later pass that +# can look at the following characters to extract the radical argument. +SENT_SQRT = "\x01SQRT\x01" +SENT_CBRT = "\x01CBRT\x01" +SENT_FRT = "\x01FRT\x01" + +REPLACEMENTS: dict = { + # Whitespace -> normal space + "\u00A0": " ", "\u2002": " ", "\u2003": " ", "\u2004": " ", + "\u2005": " ", "\u2006": " ", "\u2007": " ", "\u2008": " ", + "\u2009": " ", "\u200A": " ", "\u200B": "", "\u200C": "", + "\u200D": "", "\u202F": " ", "\u205F": " ", "\u3000": " ", + "\uFEFF": "", + + # Dashes / hyphens + # NOTE: in this dataset (kernel-variant LLM-generated math text) the + # EN DASH is used pervasively as a math minus sign, not a typographic + # en-dash, so we map it to a single hyphen-minus rather than the + # typographic `--`. The EM DASH stays as `---` (prose convention). + "\u2010": "-", "\u2011": "-", + "\u2012": "-", # FIGURE DASH + "\u2013": "-", # EN DASH (was `--`; common usage here is math minus) + "\u2014": "---", # EM DASH (typographic prose break) + "\u2015": "---", # HORIZONTAL BAR + "\u2212": "-", + + # Quotation marks + "\u2018": "`", "\u2019": "'", "\u201A": ",", "\u201B": "`", + "\u201C": "``", "\u201D": "''", "\u201E": ",,", + "\u00AB": "<<", "\u00BB": ">>", + + # Punctuation / miscellany + "\u2022": "*", + "\u2023": "*", + "\u2027": ".", + "\u2026": r"\ldots", + "\u00B7": r"\cdot", + "\u00B0": r"^\circ", + "\u2032": "'", "\u2033": "''", "\u2034": "'''", "\u2035": "`", + "\u2605": r"\star", + "\u2606": r"\star", + "\u25A0": r"\blacksquare", + "\u25A1": r"\square", + "\u220E": r"\blacksquare", + "\u2020": r"\dagger", + "\u2021": r"\ddagger", + "\u2044": "/", + + # Sub/super digits + "\u2070": "^0", "\u00B9": "^1", "\u00B2": "^2", "\u00B3": "^3", + "\u2074": "^4", "\u2075": "^5", "\u2076": "^6", "\u2077": "^7", + "\u2078": "^8", "\u2079": "^9", + "\u207A": "^+", "\u207B": "^-", "\u207C": "^=", "\u207D": "^(", "\u207E": "^)", + "\u2080": "_0", "\u2081": "_1", "\u2082": "_2", "\u2083": "_3", + "\u2084": "_4", "\u2085": "_5", "\u2086": "_6", "\u2087": "_7", + "\u2088": "_8", "\u2089": "_9", + "\u208A": "_+", "\u208B": "_-", "\u208C": "_=", "\u208D": "_(", "\u208E": "_)", + + # Latin sub/super letters + "\u2090": "_a", "\u2091": "_e", "\u2092": "_o", "\u2093": "_x", + "\u2095": "_h", "\u2096": "_k", "\u2097": "_l", "\u2098": "_m", + "\u2099": "_n", "\u209A": "_p", "\u209B": "_s", "\u209C": "_t", + "\u2C7C": "_j", # LATIN SUBSCRIPT SMALL LETTER J + "\u1D30": "^D", "\u1D31": "^E", "\u1D33": "^G", "\u1D34": "^H", + "\u1D35": "^I", "\u1D36": "^J", "\u1D37": "^K", "\u1D38": "^L", + "\u1D39": "^M", "\u1D3A": "^N", "\u1D3C": "^O", "\u1D3E": "^P", + "\u1D3F": "^R", "\u1D40": "^T", "\u1D41": "^U", "\u1D42": "^W", + "\u1D43": "^a", "\u1D47": "^b", "\u1D48": "^d", "\u1D49": "^e", + "\u1D4D": "^g", "\u1D4F": "^k", "\u1D50": "^m", "\u1D52": "^o", + "\u1D56": "^p", "\u1D57": "^t", "\u1D58": "^u", "\u1D5B": "^v", + "\u1D62": "_i", "\u1D63": "_r", "\u1D64": "_u", "\u1D65": "_v", + "\u2071": "^i", "\u207F": "^n", + + # Greek lower case + "\u03B1": r"\alpha", "\u03B2": r"\beta", "\u03B3": r"\gamma", + "\u03B4": r"\delta", "\u03B5": r"\varepsilon", "\u03B6": r"\zeta", + "\u03B7": r"\eta", "\u03B8": r"\theta", "\u03B9": r"\iota", + "\u03BA": r"\kappa", "\u03BB": r"\lambda", "\u03BC": r"\mu", + "\u03BD": r"\nu", "\u03BE": r"\xi", "\u03BF": "o", + "\u03C0": r"\pi", "\u03C1": r"\rho", "\u03C2": r"\varsigma", + "\u03C3": r"\sigma", "\u03C4": r"\tau", "\u03C5": r"\upsilon", + "\u03C6": r"\varphi", "\u03C7": r"\chi", "\u03C8": r"\psi", + "\u03C9": r"\omega", + "\u03D5": r"\phi", "\u03D1": r"\vartheta", "\u03D6": r"\varpi", + "\u03F1": r"\varrho", "\u03F5": r"\epsilon", + # Greek upper case + "\u0391": "A", "\u0392": "B", "\u0393": r"\Gamma", + "\u0394": r"\Delta", "\u0395": "E", "\u0396": "Z", + "\u0397": "H", "\u0398": r"\Theta", "\u0399": "I", + "\u039A": "K", "\u039B": r"\Lambda", "\u039C": "M", + "\u039D": "N", "\u039E": r"\Xi", "\u039F": "O", + "\u03A0": r"\Pi", "\u03A1": "P", "\u03A3": r"\Sigma", + "\u03A4": "T", "\u03A5": r"\Upsilon", "\u03A6": r"\Phi", + "\u03A7": "X", "\u03A8": r"\Psi", "\u03A9": r"\Omega", + + # Math operators / relations + "\u2200": r"\forall", "\u2203": r"\exists", "\u2204": r"\nexists", + "\u2205": r"\emptyset", + "\u2208": r"\in", "\u2209": r"\notin", "\u220B": r"\ni", + "\u220F": r"\prod", "\u2210": r"\coprod", "\u2211": r"\sum", + "\u2213": r"\mp", "\u00B1": r"\pm", + "\u2214": r"\dotplus", + "\u2217": "*", "\u2218": r"\circ", "\u2219": r"\cdot", + "\u221D": r"\propto", + "\u221E": r"\infty", + "\u2220": r"\angle", "\u2221": r"\measuredangle", + "\u2225": r"\parallel", "\u2226": r"\nparallel", + "\u2227": r"\land", "\u2228": r"\lor", + "\u2229": r"\cap", "\u222A": r"\cup", + "\u222B": r"\int", "\u222C": r"\iint", "\u222D": r"\iiint", + "\u222E": r"\oint", "\u222F": r"\oiint", + "\u2234": r"\therefore", "\u2235": r"\because", + "\u2236": ":", "\u2237": "::", + "\u223C": r"\sim", "\u2243": r"\simeq", "\u2245": r"\cong", + "\u2248": r"\approx", "\u224D": r"\asymp", + "\u2250": r"\doteq", + "\u2260": r"\neq", "\u2261": r"\equiv", "\u2262": r"\not\equiv", + "\u2264": r"\leq", "\u2265": r"\geq", + "\u2266": r"\leqq", "\u2267": r"\geqq", + "\u226A": r"\ll", "\u226B": r"\gg", + "\u2270": r"\not\leq", "\u2271": r"\not\geq", + "\u2282": r"\subset", "\u2283": r"\supset", + "\u2284": r"\not\subset", "\u2285": r"\not\supset", + "\u2286": r"\subseteq", "\u2287": r"\supseteq", + "\u2288": r"\not\subseteq", "\u2289": r"\not\supseteq", + "\u228A": r"\subsetneq", "\u228B": r"\supsetneq", + "\u2295": r"\oplus", "\u2296": r"\ominus", + "\u2297": r"\otimes", "\u2298": r"\oslash", "\u2299": r"\odot", + "\u22A2": r"\vdash", "\u22A3": r"\dashv", + "\u22A4": r"\top", "\u22A5": r"\bot", + "\u22A8": r"\models", + "\u22C0": r"\bigwedge", "\u22C1": r"\bigvee", + "\u22C2": r"\bigcap", "\u22C3": r"\bigcup", + "\u22C5": r"\cdot", "\u22C6": r"\star", + "\u22EE": r"\vdots", "\u22EF": r"\cdots", + "\u22F1": r"\ddots", + + # Arrows + "\u2190": r"\leftarrow", "\u2192": r"\to", + "\u2191": r"\uparrow", "\u2193": r"\downarrow", + "\u2194": r"\leftrightarrow", "\u2195": r"\updownarrow", + "\u21A0": r"\twoheadrightarrow", + "\u21A6": r"\mapsto", + "\u21D0": r"\Leftarrow", "\u21D2": r"\Rightarrow", + "\u21D1": r"\Uparrow", "\u21D3": r"\Downarrow", + "\u21D4": r"\Leftrightarrow", + "\u27F6": r"\longrightarrow", "\u27F5": r"\longleftarrow", + "\u27F9": r"\Longrightarrow", "\u27F8": r"\Longleftarrow", + "\u27FA": r"\Longleftrightarrow", + + # Delimiters + "\u2016": r"\|", + "\u2308": r"\lceil", "\u2309": r"\rceil", + "\u230A": r"\lfloor", "\u230B": r"\rfloor", + "\u27E8": r"\langle", "\u27E9": r"\rangle", + "\u27EA": r"\llangle", "\u27EB": r"\rrangle", + + # Blackboard / script letters + "\u2102": r"\mathbb{C}", "\u210D": r"\mathbb{H}", + "\u2115": r"\mathbb{N}", "\u2119": r"\mathbb{P}", + "\u211A": r"\mathbb{Q}", "\u211D": r"\mathbb{R}", + "\u2124": r"\mathbb{Z}", + "\u2113": r"\ell", "\u210F": r"\hbar", + "\u2202": r"\partial", "\u2207": r"\nabla", "\u2118": r"\wp", + "\u2133": r"\mathcal{M}", "\u2112": r"\mathcal{L}", + "\u211B": r"\mathcal{R}", "\u2110": r"\mathcal{I}", + "\u2130": r"\mathcal{E}", "\u2132": "F", + + # Fractions with precomposed forms + "\u00BC": r"\frac{1}{4}", "\u00BD": r"\frac{1}{2}", "\u00BE": r"\frac{3}{4}", + "\u2153": r"\frac{1}{3}", "\u2154": r"\frac{2}{3}", + "\u2155": r"\frac{1}{5}", "\u2156": r"\frac{2}{5}", + "\u2157": r"\frac{3}{5}", "\u2158": r"\frac{4}{5}", + "\u2159": r"\frac{1}{6}", "\u215A": r"\frac{5}{6}", + "\u215B": r"\frac{1}{8}", "\u215C": r"\frac{3}{8}", + "\u215D": r"\frac{5}{8}", "\u215E": r"\frac{7}{8}", + + # Multiplication / division + "\u00D7": r"\times", "\u00F7": r"\div", + + # Misc + "\u00A7": r"\S", + "\u00B6": r"\P", + "\u00A9": "(c)", "\u00AE": "(R)", "\u2122": "(TM)", + "\u00A3": r"\pounds", "\u20AC": "EUR", + "\u00B5": r"\mu", + + # Additional math symbols + "\u2216": r"\setminus", + "\u2223": r"\mid", + "\u2224": r"\nmid", + "\u2225": r"\parallel", # duplicate of above, safe + "\u2226": r"\nparallel", + "\u22BB": r"\veebar", + "\u22BC": r"\barwedge", + "\u2238": r"\dot{-}", + "\u22C8": r"\bowtie", + "\u22CE": r"\curlyvee", + "\u22CF": r"\curlywedge", + + # Perp and triangle family + "\u27C2": r"\perp", + "\u22A5": r"\bot", # already present but safe + "\u25B3": r"\triangle", + "\u25B4": r"\blacktriangle", + "\u25BD": r"\triangledown", + "\u25BE": r"\blacktriangledown", + "\u25C1": r"\triangleleft", + "\u25C2": r"\blacktriangleleft", + "\u25B7": r"\triangleright", + "\u25B8": r"\blacktriangleright", + + # Square / box operators + "\u2293": r"\sqcap", + "\u2294": r"\sqcup", + "\u22A1": r"\boxdot", + "\u229E": r"\boxplus", + "\u229F": r"\boxminus", + "\u22A0": r"\boxtimes", + + # Preceq / succeq family + "\u227A": r"\prec", + "\u227B": r"\succ", + "\u227C": r"\preceq", + "\u227D": r"\succeq", + "\u2280": r"\nprec", + "\u2281": r"\nsucc", + "\u22E0": r"\npreceq", + "\u22E1": r"\nsucceq", + + # Double-square brackets + "\u27E6": r"\llbracket", + "\u27E7": r"\rrbracket", + + # Card-suit decorative (drop) + "\u2660": "", # spade + "\u2661": "", + "\u2662": "", + "\u2663": "", # club + "\u2664": "", + "\u2665": "", # heart + "\u2666": "", # diamond + + # Musical / dingbat decorations (drop) + "\u266A": "", # eighth note + "\u266B": "", # beamed eighth notes + "\u2713": r"\checkmark", + "\u2717": r"\times", + + # Curved delimiters / bracket extension pieces -- these are used by the + # kernel generator to draw big parentheses/brackets around multi-line + # expressions (like matrices). They are purely decorative in plain text + # and we drop them. + "\u239B": "", "\u239C": "", "\u239D": "", # ( upper/mid/lower + "\u239E": "", "\u239F": "", "\u23A0": "", # ) upper/mid/lower + "\u23A1": "", "\u23A2": "", "\u23A3": "", # [ upper/mid/lower + "\u23A4": "", "\u23A5": "", "\u23A6": "", # ] upper/mid/lower + "\u23A7": "", "\u23A8": "", "\u23A9": "", # { upper/middle/lower + "\u23AA": "", # { extension + "\u23AB": "", "\u23AC": "", "\u23AD": "", # } upper/middle/lower + "\u23AE": "", # integral extension + "\u23AF": "", # horizontal line extension + "\u23B0": "", "\u23B1": "", # upper/lower curly bracket + "\u23B2": "", "\u23B3": "", # summation top/bottom + "\u23B4": "", "\u23B5": "", # top/bottom square bracket + "\u23B6": "", "\u23B7": "", # bottom square bracket w/tick + "\u23D0": "", # vertical line extension + + # Combining over/underlines are stripped by the combining-mark regex + + # Additional remaining symbols found after first clean pass + "\u00AD": "", # SOFT HYPHEN -> delete + "\u2215": "/", # DIVISION SLASH + "\u25A2": r"\square", # WHITE SQUARE WITH ROUNDED CORNERS + "\u2718": r"\times", # HEAVY BALLOT X + "\u3008": r"\langle", # CJK LEFT ANGLE BRACKET + "\u3009": r"\rangle", # CJK RIGHT ANGLE BRACKET + "\u2254": ":=", # COLON EQUALS + "\u2255": "=:", # EQUALS COLON + "\u2198": r"\searrow", # SOUTH EAST ARROW + "\u2197": r"\nearrow", # NORTH EAST ARROW + "\u2199": r"\swarrow", + "\u2196": r"\nwarrow", + "\u21A9": r"\hookleftarrow", + "\u21AA": r"\hookrightarrow", + "\u21BC": r"\leftharpoonup", + "\u21BD": r"\leftharpoondown", + "\u21BE": r"\upharpoonright", + "\u21BF": r"\upharpoonleft", + "\u21C0": r"\rightharpoonup", + "\u21C1": r"\rightharpoondown", + "\u21C2": r"\downharpoonright", + "\u21C3": r"\downharpoonleft", + "\u21CC": r"\rightleftharpoons", + "\u21E2": r"\dashrightarrow", + "\u21E0": r"\dashleftarrow", + "\u2277": r"\gtrless", + "\u2276": r"\lessgtr", + + # Private Use Area characters are almost always OCR garbage or + # font-specific glyphs; drop them. + "\uF8EB": "", "\uF8F6": "", + "\uF8FE": "", "\uF8FD": "", "\uF8FC": "", "\uF8FB": "", + "\uF8EF": "", "\uF8F0": "", "\uF8F1": "", "\uF8F2": "", + + # A few more rare but meaningful math symbols + "\u2322": r"\frown", + "\u2323": r"\smile", + "\u226D": r"\not\asymp", + "\u22A7": r"\models", + "\u22B2": r"\vartriangleleft", + "\u22B3": r"\vartriangleright", + "\u22B4": r"\trianglelefteq", + "\u22B5": r"\trianglerighteq", + + # Small-caps letters sometimes emitted by OCR (collapse to plain letter) + "\u026A": "I", # LATIN LETTER SMALL CAPITAL I + "\u1D00": "A", + "\u1D04": "C", + "\u1D05": "D", + "\u1D07": "E", + "\u0262": "G", + "\u029C": "H", + + # Remaining math symbols found after pass 2 + "\u2A01": r"\bigoplus", + "\u2A02": r"\bigotimes", + "\u2A00": r"\bigodot", + "\u2A03": r"\biguplus", + "\u2A04": r"\biguplus", + "\u2A05": r"\bigsqcap", + "\u2A06": r"\bigsqcup", + "\u2272": r"\lesssim", + "\u2273": r"\gtrsim", + "\u226E": r"\not<", + "\u226F": r"\not>", + "\u27EE": "(", # MATHEMATICAL LEFT FLATTENED PARENTHESIS + "\u27EF": ")", # MATHEMATICAL RIGHT FLATTENED PARENTHESIS + "\u2610": r"\square", # BALLOT BOX + "\u2611": r"\checkmark", + "\u2612": r"\times", + + # Root sentinels (wrapped in a later pass) + "\u221A": SENT_SQRT, + "\u221B": SENT_CBRT, + "\u221C": SENT_FRT, +} + + +_COMBINING_MARK_RE = re.compile( + r"[\u0300-\u036F\u1AB0-\u1AFF\u1DC0-\u1DFF\u20D0-\u20FF\uFE20-\uFE2F]") +_BOX_DRAWING_RE = re.compile(r"[\u2500-\u257F\u2580-\u259F]") + +# Characters from scripts that have no place in English/Greek mathematics +# and are clearly OCR noise when they appear. Drop them wholesale. Latin and +# Greek are preserved; extended Latin letters with diacritics are still +# handled by the NFKD fallback. +_OCR_NOISE_SCRIPTS_RE = re.compile( + r"[\u0400-\u04FF" # Cyrillic + r"\u0500-\u052F" # Cyrillic Supplement + r"\u0530-\u058F" # Armenian + r"\u0590-\u05FF" # Hebrew + r"\u0600-\u06FF" # Arabic + r"\u0700-\u074F" # Syriac + r"\u0750-\u077F" # Arabic Supplement + r"\u0780-\u07BF" # Thaana + r"\u0900-\u097F" # Devanagari + r"\u0B80-\u0BFF" # Tamil + r"\u0C00-\u0C7F" # Telugu + r"\u0C80-\u0CFF" # Kannada + r"\u0D00-\u0D7F" # Malayalam + r"\u0D80-\u0DFF" # Sinhala + r"\u0E00-\u0E7F" # Thai + r"\u0E80-\u0EFF" # Lao + r"\u0F00-\u0FFF" # Tibetan + r"\u1000-\u109F" # Myanmar + r"\u10A0-\u10FF" # Georgian + r"\u1100-\u11FF" # Hangul Jamo + r"\u1400-\u167F" # Unified Canadian Aboriginal Syllabics + r"\u1680-\u169F" # Ogham + r"\u16A0-\u16FF" # Runic + r"\u1700-\u171F" # Tagalog + r"\u1780-\u17FF" # Khmer + r"\u1800-\u18AF" # Mongolian + r"\u1900-\u194F" # Limbu + r"\u3040-\u309F" # Hiragana + r"\u30A0-\u30FF" # Katakana + r"\u3000-\u303F" # CJK Symbols and Punctuation (incl. ideographic full stop) + r"\u3100-\u312F" # Bopomofo + r"\u3130-\u318F" # Hangul Compatibility Jamo + r"\u3190-\u319F" # Kanbun + r"\u3400-\u4DBF" # CJK Extension A + r"\u4E00-\u9FFF" # CJK Unified Ideographs + r"\uA000-\uA48F" # Yi Syllables + r"\uAC00-\uD7AF" # Hangul Syllables + r"\uE000-\uF8FF" # Private Use Area + r"\uFE00-\uFE0F" # Variation Selectors + r"\uFE30-\uFE4F" # CJK Compatibility Forms (vertical presentation + # brackets that NFKD-decompose to literal { } [ ] etc., + # which would corrupt our brace balance — drop them) + r"\uFE50-\uFE6F" # Small Form Variants (compatibility forms) + r"\uFFFC\uFFFD" # Object/Replacement Character + r"]" +) + +# Emoji and pictographs (outside the BMP, need surrogate handling) +_EMOJI_RE = re.compile( + "[" + "\U0001F000-\U0001F9FF" # Emoji blocks + "\U0001FA00-\U0001FAFF" # Symbols & Pictographs Extended-A + "\U0001F1E6-\U0001F1FF" # Regional indicator symbols + "\U0001F3FB-\U0001F3FF" # Emoji modifier fitzpatrick + "\U00020000-\U0002FA1F" # CJK Extensions B-F + "]", + flags=re.UNICODE +) + + +def prestrip(text: str) -> str: + """Strip decorative and OCR-noise characters BEFORE char substitution. + + Important: we do NOT run NFKD here because NFKD decomposes subscript / + superscript digits (e.g. \u2080 -> '0') before our explicit REPLACEMENTS + entries can rewrite them as `_0`. NFKD is applied later only as a + fallback for characters that survive the explicit substitution pass + (e.g. accented Latin letters). + """ + if not text: + return text + text = _BOX_DRAWING_RE.sub("", text) + # Lone combining marks are orphaned when the base character was something + # we otherwise transformed; strip them up front. + text = _COMBINING_MARK_RE.sub("", text) + # Strip OCR-noise scripts (Cyrillic / Arabic / CJK / etc.) that have no + # place in English-Greek mathematical prose. + text = _OCR_NOISE_SCRIPTS_RE.sub("", text) + # Strip emoji / pictographs (clearly LLM-emitted noise in math text). + text = _EMOJI_RE.sub("", text) + return text + + +def char_substitute(text: str, unmapped: Counter) -> str: + """Apply REPLACEMENTS char-by-char. Any char not in REPLACEMENTS is left + in place so that _nfkd_fallback (run next) has a chance to handle it + via compatibility decomposition. A trailing space is appended to bare + `\\word` LaTeX commands so subsequent letters do not get absorbed into + the command name. + """ + out = [] + for ch in text: + if ord(ch) <= 127 or ch == "\x01": + out.append(ch) + continue + if ch in REPLACEMENTS: + val = REPLACEMENTS[ch] + # Bare `\word` (starts with `\\`, ends in a letter) needs a + # trailing space so that `\cdot t` does not become `\cdott`. + if (len(val) >= 2 and val[0] == "\\" + and val[-1].isalpha() + and not val.startswith("\x01")): + val = val + " " + out.append(val) + continue + # Unmapped: keep as-is and let _nfkd_fallback try compat decomposition. + out.append(ch) + return "".join(out) + + +def _merge_sub_sup(text: str) -> str: + def _do(prefix, m): + # Extract each ^X or _X token and concatenate the X parts. + vals = re.findall(r"[\+\-\=\(\)a-zA-Z0-9]", m.group(0)) + # The regex captures the X char from each ^X or _X; above regex + # finds ALL alnum/sign chars in the match. But `^+` etc. we want + # to keep as-is. Simplest: split on the prefix. + pieces = [p for p in re.split(r"[\^_]", m.group(0)) if p] + joined = "".join(pieces) + return f"{prefix}{{{joined}}}" + + text = re.sub( + r"(?:\^[\+\-\=\(\)a-zA-Z0-9])(?:\^[\+\-\=\(\)a-zA-Z0-9])+", + lambda m: _do("^", m), text) + text = re.sub( + r"(?:_[\+\-\=\(\)a-zA-Z0-9])(?:_[\+\-\=\(\)a-zA-Z0-9])+", + lambda m: _do("_", m), text) + return text + + +_SENTINEL_RE = re.compile(r"\x01(SQRT|CBRT|FRT)\x01") + + +def _skip_spaces(s: str, i: int) -> int: + while i < len(s) and s[i] in " \t": + i += 1 + return i + + +def _read_balanced(s: str, i: int, open_ch: str, close_ch: str): + depth = 0 + j = i + while j < len(s): + if s[j] == open_ch: + depth += 1 + elif s[j] == close_ch: + depth -= 1 + if depth == 0: + return j + 1 + j += 1 + return -1 + + +def _read_latex_command(s: str, i: int): + if i >= len(s) or s[i] != "\\": + return -1 + j = i + 1 + while j < len(s) and (s[j].isalpha() or s[j] == "@"): + j += 1 + while j < len(s) and s[j] == "{": + end = _read_balanced(s, j, "{", "}") + if end == -1: + return j + j = end + return j + + +def _wrap_radical_arguments(text: str) -> str: + out = [] + i = 0 + LATEX_FOR = {"SQRT": r"\sqrt", "CBRT": r"\sqrt[3]", "FRT": r"\sqrt[4]"} + while i < len(text): + m = _SENTINEL_RE.match(text, i) + if not m: + out.append(text[i]) + i += 1 + continue + kind = m.group(1) + latex_prefix = LATEX_FOR[kind] + j = _skip_spaces(text, m.end()) + if j >= len(text): + out.append(latex_prefix + "{}") + i = j + continue + ch = text[j] + if ch == "(": + arg_end = _read_balanced(text, j, "(", ")") + if arg_end != -1: + arg = text[j + 1 : arg_end - 1] + out.append(f"{latex_prefix}{{{arg}}}") + i = arg_end + continue + if ch == "[": + arg_end = _read_balanced(text, j, "[", "]") + if arg_end != -1: + arg = text[j + 1 : arg_end - 1] + out.append(f"{latex_prefix}{{{arg}}}") + i = arg_end + continue + if ch == "{": + arg_end = _read_balanced(text, j, "{", "}") + if arg_end != -1: + arg = text[j + 1 : arg_end - 1] + out.append(f"{latex_prefix}{{{arg}}}") + i = arg_end + continue + if ch == "\\": + arg_end = _read_latex_command(text, j) + if arg_end != -1: + arg = text[j:arg_end] + out.append(f"{latex_prefix}{{{arg}}}") + i = arg_end + continue + # Fallback: alnum run (and dots for things like 3.14) + k = j + while k < len(text) and (text[k].isalnum() or text[k] in "."): + k += 1 + if k > j: + arg = text[j:k] + out.append(f"{latex_prefix}{{{arg}}}") + i = k + continue + out.append(latex_prefix + "{}") + i = m.end() + return "".join(out) + + +def _nfkd_fallback(text: str, unmapped: Counter) -> str: + """For characters that survived explicit substitution and are still + non-ASCII (e.g. precomposed accented Latin letters like \u00E9 / e-acute, + or classical Greek letters with breathing marks like \u1F42), run NFKD + and drop combining marks, then re-apply REPLACEMENTS (because NFKD can + unmask characters that do appear in REPLACEMENTS, e.g. \u1F42 -> \u03B3). + Finally, any character that is still non-ASCII is logged and dropped. + """ + has_non_ascii = any(ord(c) > 127 and c != "\x01" for c in text) + if not has_non_ascii: + return text + text = unicodedata.normalize("NFKD", text) + text = _COMBINING_MARK_RE.sub("", text) + # Second pass of char_substitute now that NFKD has possibly surfaced + # characters that were previously embedded in precomposed forms. + text = char_substitute(text, unmapped) # unmapped counter accumulates + # Final drop of anything still non-ASCII + out = [] + for c in text: + if ord(c) <= 127 or c == "\x01": + out.append(c) + else: + unmapped[c] += 1 + return "".join(out) + + +def clean_text(text: str, unmapped: Counter) -> str: + if not text: + return text + text = prestrip(text) + text = char_substitute(text, unmapped) + text = _nfkd_fallback(text, unmapped) + text = _merge_sub_sup(text) + text = _wrap_radical_arguments(text) + return text + + +def clean_problem(problem: dict, unmapped: Counter): + for k in TOP_LEVEL_TEXT_FIELDS: + if isinstance(problem.get(k), str): + problem[k] = clean_text(problem[k], unmapped) + variants = problem.get("variants") or {} + for vk in VARIANT_KEYS: + vd = variants.get(vk) + if not isinstance(vd, dict): + continue + for k in VARIANT_TEXT_FIELDS: + if isinstance(vd.get(k), str): + vd[k] = clean_text(vd[k], unmapped) + return problem + + +def process_dir(dataset_dir: Path): + print(f"\n=== Cleaning {dataset_dir} ===") + files = sorted(dataset_dir.glob("*.json")) + unmapped = Counter() + n_modified = 0 + for f in files: + try: + d = json.load(open(f)) + except Exception as e: + print(f" ! skip {f.name}: {e}") + continue + before = json.dumps(d, ensure_ascii=False) + d = clean_problem(d, unmapped) + after = json.dumps(d, ensure_ascii=False) + if before != after: + n_modified += 1 + with open(f, "w") as fh: + json.dump(d, fh, ensure_ascii=False, indent=2) + print(f" files modified: {n_modified}/{len(files)}") + if unmapped: + print(f" unmapped characters: {sum(unmapped.values())} occurrences, " + f"{len(unmapped)} distinct") + print(f" top 20 unmapped:") + for ch, n in unmapped.most_common(20): + name = unicodedata.name(ch, "?") + print(f" {ch!r:<10} U+{ord(ch):04X} n={n} ({name})") + else: + print(f" no unmapped characters") + return unmapped + + +def main(): + all_unmapped = Counter() + for d in DIRS: + if d.exists(): + u = process_dir(d) + all_unmapped.update(u) + print(f"\n=== OVERALL ===") + print(f"Total unmapped characters across both dataset copies: {sum(all_unmapped.values())}") + print(f"Distinct unmapped: {len(all_unmapped)}") + if all_unmapped: + out_path = Path("/home/yurenh2/gap/analysis/unmapped_chars.json") + json.dump({f"U+{ord(c):04X}": {"char": c, "name": unicodedata.name(c, "?"), + "count": n} + for c, n in all_unmapped.most_common()}, + open(out_path, "w"), indent=2, ensure_ascii=False) + print(f"Saved unmapped list -> {out_path}") + + +if __name__ == "__main__": + main() diff --git a/kv_math_200.py b/kv_math_200.py new file mode 100644 index 0000000..0400bf2 --- /dev/null +++ b/kv_math_200.py @@ -0,0 +1,377 @@ +#!/usr/bin/env python3 +""" +KV on MATH: Generate Kernel Variants for 200 MATH Level 5 problems. +Async parallel with repair loop and 3 judges. Resumes from previous run. +""" + +import json, asyncio, random, re, os, sys, time +from datasets import load_dataset +from openai import AsyncOpenAI + +client = AsyncOpenAI() +SEM_O3 = asyncio.Semaphore(3) # o3 calls - conservative +SEM_GPT4O = asyncio.Semaphore(20) # gpt-4o calls +SEM_EVAL = asyncio.Semaphore(40) # evaluation calls +random.seed(42) + +OUTPUT_DIR = '/home/yurenh2/gap/mini_gap_math_results/kv_200' +os.makedirs(OUTPUT_DIR, exist_ok=True) +PROGRESS_FILE = os.path.join(OUTPUT_DIR, 'kv_generation.json') +LOCK = asyncio.Lock() + +# ============================================================ +# Prompts +# ============================================================ + +SLOT_DISCOVERY = """You are a mathematical analysis expert. Given a math problem and its solution, identify all "mutable slots" — numerical constants, parameters, coefficients, or specific values that could be changed to create a new but structurally equivalent problem. + +For each slot provide: the original value, what it represents, and constraints on alternatives. + +Return ONLY valid JSON: +{"mutable_slots": [{"value": "...", "role": "...", "constraints": "..."}, ...]} +If no mutable slots exist, return: {"mutable_slots": []}""" + +BACK_SYNTHESIS = """You are creating a mathematical variant problem. Given an original problem, its solution, and mutable slots: +- Choose NEW values for each slot satisfying constraints +- Rewrite the problem with new values +- Work out the complete new solution step by step +- The new problem MUST be solvable following the same reasoning + +Return ONLY valid JSON: +{"new_problem": "...", "new_solution": "...", "new_answer": "...", "slot_changes": [{"original": "...", "new": "..."}]}""" + +VERIFY = """You are a rigorous mathematical verifier. Given a problem and solution: +1. Is the problem well-defined and solvable? +2. Is every step mathematically correct? +3. Does the solution correctly arrive at the stated answer? + +Reply with EXACTLY: +VERDICT: ACCEPT +or +VERDICT: REJECT +REASON: [what is wrong]""" + +REPAIR = """The following mathematical variant was rejected. Fix it. +Problem: {problem} +Solution: {solution} +Rejection reason: {reason} +Return ONLY valid JSON: +{{"new_problem": "...", "new_solution": "...", "new_answer": "..."}}""" + +# ============================================================ +# API Helpers +# ============================================================ + +def extract_json(text): + if not text: + return None + try: + return json.loads(text) + except: + pass + match = re.search(r'```(?:json)?\s*(\{[\s\S]*?\})\s*```', text) + if match: + try: + return json.loads(match.group(1)) + except: + pass + match = re.search(r'\{[\s\S]*\}', text) + if match: + try: + return json.loads(match.group()) + except: + pass + return None + +async def call_api(messages, model="gpt-4o", max_tokens=4000): + sem = SEM_O3 if model == "o3" else SEM_GPT4O + async with sem: + for attempt in range(5): + try: + kwargs = {"model": model, "messages": messages} + if model == "o3": + kwargs["max_completion_tokens"] = max_tokens + else: + kwargs["max_tokens"] = max_tokens + kwargs["temperature"] = 0 + resp = await client.chat.completions.create(**kwargs) + return resp.choices[0].message.content + except Exception as e: + wait = min(60, (2 ** attempt) * 3) + if attempt < 4: + await asyncio.sleep(wait) + else: + return None + +async def save_result(result, all_results): + async with LOCK: + all_results.append(result) + with open(PROGRESS_FILE, 'w') as f: + json.dump(all_results, f) + +# ============================================================ +# KV Pipeline +# ============================================================ + +async def generate_kv(problem, solution, idx, all_results, max_repairs=3, n_judges=3): + # Stage 1: Slot Discovery (gpt-4o for speed) + slot_text = await call_api( + [{"role": "system", "content": SLOT_DISCOVERY}, + {"role": "user", "content": f"Problem:\n{problem}\n\nSolution:\n{solution}"}], + model="gpt-4o", max_tokens=2000 + ) + slots_data = extract_json(slot_text) if slot_text else None + if not slots_data or not slots_data.get('mutable_slots'): + result = {'status': 'no_slots', 'original_index': idx, 'reason': 'no mutable slots'} + await save_result(result, all_results) + print(f"[{idx}] no_slots") + return + + n_slots = len(slots_data['mutable_slots']) + + # Stage 2: Back-synthesis (o3 for quality) + synth_text = await call_api( + [{"role": "system", "content": BACK_SYNTHESIS}, + {"role": "user", "content": f"Original problem:\n{problem}\n\nOriginal solution:\n{solution}\n\nMutable slots:\n{json.dumps(slots_data['mutable_slots'])}\n\nCreate a variant."}], + model="o3", max_tokens=6000 + ) + synth_data = extract_json(synth_text) if synth_text else None + if not synth_data or not synth_data.get('new_problem'): + result = {'status': 'error', 'original_index': idx, 'reason': 'synthesis failed'} + await save_result(result, all_results) + print(f"[{idx}] synthesis_error") + return + + new_problem = synth_data['new_problem'] + new_solution = synth_data['new_solution'] + new_answer = synth_data.get('new_answer', '') + + # Stage 3: Verify with repair loop + for repair_round in range(max_repairs + 1): + # Run judges in parallel + judge_tasks = [] + for _ in range(n_judges): + judge_tasks.append(call_api( + [{"role": "system", "content": VERIFY}, + {"role": "user", "content": f"Problem:\n{new_problem}\n\nSolution:\n{new_solution}"}], + model="o3", max_tokens=500 + )) + judge_results = await asyncio.gather(*judge_tasks) + + accepts = 0 + reasons = [] + for jr in judge_results: + if jr and 'ACCEPT' in jr.upper() and 'REJECT' not in jr.upper(): + accepts += 1 + else: + match = re.search(r'REASON:\s*(.*)', jr or '', re.IGNORECASE) + reasons.append(match.group(1).strip() if match else (jr or 'unknown')[:200]) + + if accepts == n_judges: + result = { + 'status': 'accepted', + 'original_index': idx, + 'original_problem': problem, + 'original_solution': solution, + 'mutable_slots': slots_data['mutable_slots'], + 'kv_problem': new_problem, + 'kv_solution': new_solution, + 'kv_answer': new_answer, + 'slot_changes': synth_data.get('slot_changes', []), + 'repair_rounds': repair_round, + 'n_slots': n_slots, + } + await save_result(result, all_results) + print(f"[{idx}] ACCEPTED (round {repair_round}, {n_slots} slots)") + return + + if repair_round < max_repairs: + reason_str = '; '.join(reasons[:2])[:500] + repair_text = await call_api( + [{"role": "system", "content": REPAIR.format(problem=new_problem, solution=new_solution, reason=reason_str)}, + {"role": "user", "content": "Fix the variant."}], + model="o3", max_tokens=6000 + ) + repair_data = extract_json(repair_text) if repair_text else None + if repair_data: + new_problem = repair_data.get('new_problem', new_problem) + new_solution = repair_data.get('new_solution', new_solution) + new_answer = repair_data.get('new_answer', new_answer) + + result = {'status': 'rejected', 'original_index': idx, 'reason': f'failed {max_repairs} repairs'} + await save_result(result, all_results) + print(f"[{idx}] REJECTED") + +# ============================================================ +# Evaluation +# ============================================================ + +def extract_boxed(text): + if not text: + return None + matches = [] + i = 0 + while i < len(text): + idx = text.find('\\boxed{', i) + if idx == -1: + break + depth = 1; j = idx + 7 + while j < len(text) and depth > 0: + if text[j] == '{': depth += 1 + elif text[j] == '}': depth -= 1 + j += 1 + if depth == 0: + matches.append(text[idx+7:j-1].strip()) + i = j + return matches[-1] if matches else None + +async def evaluate_all(accepted_results): + if not accepted_results: + return {} + + async def solve(problem, model): + async with SEM_EVAL: + resp = await client.chat.completions.create( + model=model, temperature=0, max_tokens=2048, + messages=[ + {"role": "system", "content": "Solve step by step. Put final answer in \\boxed{}."}, + {"role": "user", "content": problem} + ], + ) + return resp.choices[0].message.content + + async def grade(ref_answer, student_answer): + async with SEM_EVAL: + resp = await client.chat.completions.create( + model="gpt-4o", temperature=0, max_tokens=10, + messages=[{"role": "user", "content": f"Are these mathematical answers equivalent? Reference: {ref_answer}\nStudent: {student_answer}\nReply CORRECT or INCORRECT."}], + ) + text = resp.choices[0].message.content.upper() + return 'INCORRECT' not in text and 'CORRECT' in text + + eval_models = ['gpt-4o', 'gpt-4o-mini'] + results = {} + + for model in eval_models: + print(f"\nEvaluating {len(accepted_results)} variants with {model}...") + + # Solve originals + orig_tasks = [solve(r['original_problem'], model) for r in accepted_results] + orig_sols = await asyncio.gather(*orig_tasks) + + # Solve KVs + kv_tasks = [solve(r['kv_problem'], model) for r in accepted_results] + kv_sols = await asyncio.gather(*kv_tasks) + + # Grade + orig_grades = [] + kv_grades = [] + for i, r in enumerate(accepted_results): + ref_orig = extract_boxed(r['original_solution']) + stu_orig = extract_boxed(orig_sols[i]) + ref_kv = r.get('kv_answer') or extract_boxed(r.get('kv_solution', '')) + stu_kv = extract_boxed(kv_sols[i]) + + og = await grade(ref_orig or 'N/A', stu_orig or 'N/A') if ref_orig and stu_orig else False + kg = await grade(ref_kv or 'N/A', stu_kv or 'N/A') if ref_kv and stu_kv else False + orig_grades.append(og) + kv_grades.append(kg) + + orig_acc = sum(orig_grades) / len(orig_grades) * 100 + kv_acc = sum(kv_grades) / len(kv_grades) * 100 + + results[model] = { + 'original_accuracy': orig_acc, + 'kv_accuracy': kv_acc, + 'delta': kv_acc - orig_acc, + 'n': len(accepted_results), + 'orig_correct': sum(orig_grades), + 'kv_correct': sum(kv_grades), + } + print(f" {model}: orig={orig_acc:.1f}%, kv={kv_acc:.1f}%, Δ={kv_acc-orig_acc:+.1f}pp (n={len(accepted_results)})") + + return results + +# ============================================================ +# Main +# ============================================================ + +async def main(): + # Load all Level 5 problems + subsets = ['algebra', 'number_theory', 'precalculus', 'intermediate_algebra', 'counting_and_probability', 'geometry'] + all_level5 = [] + for subset in subsets: + ds = load_dataset('EleutherAI/hendrycks_math', subset, split='test') + for item in ds: + if item.get('level') == 'Level 5' and len(item.get('solution', '')) > 50: + item['subject'] = subset + all_level5.append(item) + + random.shuffle(all_level5) + selected = all_level5[:200] + print(f"Selected {len(selected)} Level 5 problems") + + # Load previous results + prev_file = '/home/yurenh2/gap/mini_gap_math_results/kv_50/kv_generation.json' + prev_accepted = [] + if os.path.exists(prev_file): + with open(prev_file) as f: + prev_data = json.load(f) + prev_accepted = [r for r in prev_data if r['status'] == 'accepted'] + print(f"Loaded {len(prev_accepted)} previously accepted variants") + + # Load current progress + all_results = [] + done_indices = set() + if os.path.exists(PROGRESS_FILE): + with open(PROGRESS_FILE) as f: + all_results = json.load(f) + done_indices = {r['original_index'] for r in all_results} + print(f"Resuming: {len(all_results)} already processed") + + # Generate remaining + remaining = [(i, p) for i, p in enumerate(selected) if i not in done_indices] + print(f"Remaining: {len(remaining)} problems") + + # Process in batches of 10 for controlled parallelism + BATCH_SIZE = 10 + for batch_start in range(0, len(remaining), BATCH_SIZE): + batch = remaining[batch_start:batch_start + BATCH_SIZE] + tasks = [generate_kv(p['problem'], p['solution'], i, all_results) for i, p in batch] + await asyncio.gather(*tasks) + + # Status update + from collections import Counter + status = Counter(r['status'] for r in all_results) + accepted_count = status.get('accepted', 0) + print(f"\n--- Progress: {len(all_results)}/200, accepted={accepted_count}, status={dict(status)} ---\n") + + # Combine with previous accepted + new_accepted = [r for r in all_results if r['status'] == 'accepted'] + all_accepted = prev_accepted + new_accepted + print(f"\nTotal accepted: {len(all_accepted)} ({len(prev_accepted)} prev + {len(new_accepted)} new)") + + # Evaluate + if all_accepted: + print(f"\nEvaluating {len(all_accepted)} accepted KV variants...") + eval_results = await evaluate_all(all_accepted) + + final = { + 'generation_summary': { + 'total_attempted': 200 + 50, + 'new_accepted': len(new_accepted), + 'prev_accepted': len(prev_accepted), + 'total_accepted': len(all_accepted), + }, + 'evaluation': eval_results, + } + with open(os.path.join(OUTPUT_DIR, 'kv_final_results.json'), 'w') as f: + json.dump(final, f, indent=2) + + print(f"\n{'='*60}") + print(f"FINAL KV RESULTS ({len(all_accepted)} variants)") + print(f"{'='*60}") + for model, res in eval_results.items(): + print(f" {model}: orig={res['original_accuracy']:.1f}%, kv={res['kv_accuracy']:.1f}%, Δ={res['delta']:+.1f}pp") + +asyncio.run(main()) diff --git a/kv_math_50.py b/kv_math_50.py new file mode 100644 index 0000000..f9edfcd --- /dev/null +++ b/kv_math_50.py @@ -0,0 +1,425 @@ +#!/usr/bin/env python3 +""" +KV on MATH: Generate Kernel Variants for 50 MATH Level 5 problems. +3-stage pipeline with repair loop and 3 judges. +""" + +import json, asyncio, random, re, os, sys, time +from openai import AsyncOpenAI + +client = AsyncOpenAI() +SEM_GEN = asyncio.Semaphore(2) # generation calls (expensive o3) - low to avoid rate limits +SEM_EVAL = asyncio.Semaphore(30) # evaluation calls (cheaper) +random.seed(42) + +OUTPUT_DIR = '/home/yurenh2/gap/mini_gap_math_results/kv_50' +os.makedirs(OUTPUT_DIR, exist_ok=True) + +# ============================================================ +# Prompts +# ============================================================ + +SLOT_DISCOVERY = """You are a mathematical analysis expert. Given a math problem and its solution, identify all "mutable slots" — numerical constants, parameters, coefficients, or specific values that could be changed to create a new but structurally equivalent problem. + +For each slot: +1. The original value (as it appears in the problem) +2. What it represents mathematically +3. Constraints on alternative values (what must hold for the same solution technique to work) + +Return ONLY valid JSON: +{ + "mutable_slots": [ + {"value": "...", "role": "...", "constraints": "..."}, + ... + ] +} + +If the problem has no mutable slots (all constants are mathematically fixed), return: +{"mutable_slots": []}""" + +BACK_SYNTHESIS = """You are creating a mathematical variant problem. Given: +1. An original problem and solution +2. Mutable slots identified in the problem + +Your task: +- Choose NEW values for each mutable slot that satisfy the constraints +- Rewrite the problem with these new values +- Work out the complete new solution step by step +- The new problem MUST be solvable and the solution MUST follow the same mathematical reasoning + +Return ONLY valid JSON: +{ + "new_problem": "... (full LaTeX problem statement)", + "new_solution": "... (complete step-by-step solution)", + "new_answer": "... (final answer that would go in \\boxed{})", + "slot_changes": [{"original": "...", "new": "..."}] +}""" + +VERIFY = """You are a rigorous mathematical verifier. Given a problem and its proposed solution: + +1. Is the problem well-defined and solvable? +2. Is every step in the solution mathematically correct? +3. Does the solution correctly arrive at the stated answer? + +If ANY step is wrong or the answer doesn't follow, you MUST reject. + +Reply with EXACTLY: +VERDICT: ACCEPT +or +VERDICT: REJECT +REASON: [what is wrong]""" + +REPAIR = """The following mathematical variant was rejected by a verifier. + +Problem: {problem} +Solution: {solution} +Rejection reason: {reason} + +Please fix the solution (or the problem if needed) so that it is mathematically correct. +Keep the same problem structure and slot changes. + +Return ONLY valid JSON: +{{"new_problem": "...", "new_solution": "...", "new_answer": "..."}}""" + +# ============================================================ +# KV Generation Pipeline +# ============================================================ + +def extract_json(text): + """Extract JSON from text that may contain markdown or extra text.""" + # Try direct parse + try: + return json.loads(text) + except: + pass + # Try finding JSON block + match = re.search(r'```(?:json)?\s*(\{[\s\S]*?\})\s*```', text) + if match: + try: + return json.loads(match.group(1)) + except: + pass + # Try finding raw JSON + match = re.search(r'\{[\s\S]*\}', text) + if match: + try: + return json.loads(match.group()) + except: + pass + return None + + +async def call_llm(system_prompt, user_content, max_tokens=4000, model="gpt-4o"): + """Call LLM with rate limiting and exponential backoff.""" + sem = SEM_GEN if model == "o3" else SEM_EVAL + async with sem: + for attempt in range(4): + try: + kwargs = {"model": model, "messages": [ + {"role": "system", "content": system_prompt}, + {"role": "user", "content": user_content} + ]} + if model == "o3": + kwargs["max_completion_tokens"] = max_tokens + else: + kwargs["max_tokens"] = max_tokens + kwargs["temperature"] = 0 + resp = await client.chat.completions.create(**kwargs) + return resp.choices[0].message.content + except Exception as e: + wait = (2 ** attempt) * 3 + print(f" {model} error (attempt {attempt+1}): {e}, waiting {wait}s...") + await asyncio.sleep(wait) + print(f" {model} failed after 4 attempts") + return None + +async def call_o3(system_prompt, user_content, max_tokens=4000): + return await call_llm(system_prompt, user_content, max_tokens, model="o3") + + +async def verify_once(problem, solution): + """Single verification judge call.""" + async with SEM_GEN: + try: + resp = await client.chat.completions.create( + model="o3", max_completion_tokens=500, + messages=[ + {"role": "system", "content": VERIFY}, + {"role": "user", "content": f"Problem:\n{problem}\n\nSolution:\n{solution}"} + ], + ) + text = resp.choices[0].message.content + accepted = 'ACCEPT' in text.upper() and 'REJECT' not in text.upper() + reason = "" + if not accepted: + match = re.search(r'REASON:\s*(.*)', text, re.IGNORECASE) + reason = match.group(1).strip() if match else text + return accepted, reason + except Exception as e: + return False, str(e) + + +async def generate_kv_with_repair(problem, solution, idx, max_repairs=3, n_judges=3): + """Generate KV with repair loop and multiple judges.""" + print(f"\n[{idx}] Processing...") + + # Stage 1: Slot Discovery (use gpt-4o for speed) + slot_text = await call_llm(SLOT_DISCOVERY, f"Problem:\n{problem}\n\nSolution:\n{solution}", 2000, model="gpt-4o") + if not slot_text: + return {'status': 'error', 'reason': 'slot discovery failed'} + + slots_data = extract_json(slot_text) + if not slots_data or not slots_data.get('mutable_slots'): + print(f"[{idx}] No mutable slots found") + return {'status': 'no_slots', 'reason': 'no mutable slots identified'} + + n_slots = len(slots_data['mutable_slots']) + print(f"[{idx}] Found {n_slots} slots") + + # Stage 2: Back-synthesis (use o3 for quality) + synth_text = await call_llm( + BACK_SYNTHESIS, + f"Original problem:\n{problem}\n\nOriginal solution:\n{solution}\n\nMutable slots:\n{json.dumps(slots_data['mutable_slots'], indent=2)}\n\nCreate a variant with different values.", + 6000, model="o3" + ) + if not synth_text: + return {'status': 'error', 'reason': 'back-synthesis failed'} + + synth_data = extract_json(synth_text) + if not synth_data or not synth_data.get('new_problem'): + return {'status': 'error', 'reason': 'could not parse synthesized variant'} + + new_problem = synth_data['new_problem'] + new_solution = synth_data['new_solution'] + new_answer = synth_data.get('new_answer', '') + + # Stage 3: Verification with repair loop + for repair_round in range(max_repairs + 1): + # Run n_judges in parallel + judge_tasks = [verify_once(new_problem, new_solution) for _ in range(n_judges)] + verdicts = await asyncio.gather(*judge_tasks) + + accepts = sum(1 for v, _ in verdicts if v) + reasons = [r for v, r in verdicts if not v] + + if accepts == n_judges: + print(f"[{idx}] ACCEPTED (round {repair_round}, {n_judges}/{n_judges} judges)") + return { + 'status': 'accepted', + 'original_problem': problem, + 'original_solution': solution, + 'mutable_slots': slots_data['mutable_slots'], + 'kv_problem': new_problem, + 'kv_solution': new_solution, + 'kv_answer': new_answer, + 'slot_changes': synth_data.get('slot_changes', []), + 'repair_rounds': repair_round, + 'judge_count': n_judges, + } + + if repair_round < max_repairs: + print(f"[{idx}] Round {repair_round}: {accepts}/{n_judges} accepted, repairing...") + # Repair + reason_str = '; '.join(reasons[:2]) + repair_text = await call_llm( + REPAIR.format(problem=new_problem, solution=new_solution, reason=reason_str), + "Fix the variant.", 6000, model="o3" + ) + if repair_text: + repair_data = extract_json(repair_text) + if repair_data: + new_problem = repair_data.get('new_problem', new_problem) + new_solution = repair_data.get('new_solution', new_solution) + new_answer = repair_data.get('new_answer', new_answer) + + print(f"[{idx}] REJECTED after {max_repairs} repairs") + return { + 'status': 'rejected', + 'original_problem': problem, + 'reason': f'failed after {max_repairs} repair rounds', + 'last_reasons': reasons, + } + + +async def evaluate_kv(accepted_results, variants_data): + """Evaluate accepted KV variants with GPT-4o and GPT-4o-mini.""" + if not accepted_results: + return {} + + async def solve(problem, model): + async with SEM_EVAL: + resp = await client.chat.completions.create( + model=model, temperature=0, max_tokens=2048, + messages=[ + {"role": "system", "content": "Solve step by step. Put final answer in \\boxed{}."}, + {"role": "user", "content": problem} + ], + ) + return resp.choices[0].message.content + + async def grade(ref_answer, student_answer, model="gpt-4o"): + async with SEM_EVAL: + resp = await client.chat.completions.create( + model=model, temperature=0, max_tokens=10, + messages=[{"role": "user", "content": f"Are these two mathematical answers equivalent? Reference: {ref_answer}\nStudent: {student_answer}\nReply CORRECT or INCORRECT."}], + ) + return 'CORRECT' in resp.choices[0].message.content.upper() + + def extract_boxed(text): + if not text: return None + matches = [] + i = 0 + while i < len(text): + idx = text.find('\\boxed{', i) + if idx == -1: break + depth = 1; j = idx + 7 + while j < len(text) and depth > 0: + if text[j] == '{': depth += 1 + elif text[j] == '}': depth -= 1 + j += 1 + if depth == 0: matches.append(text[idx+7:j-1].strip()) + i = j + return matches[-1] if matches else None + + eval_models = ['gpt-4o', 'gpt-4o-mini'] + results = {} + + for model in eval_models: + print(f"\nEvaluating with {model}...") + + # Solve originals + orig_tasks = [solve(r['original_problem'], model) for r in accepted_results] + orig_solutions = await asyncio.gather(*orig_tasks) + + # Solve KV variants + kv_tasks = [solve(r['kv_problem'], model) for r in accepted_results] + kv_solutions = await asyncio.gather(*kv_tasks) + + # Grade originals + orig_grades = [] + for i, (sol, r) in enumerate(zip(orig_solutions, accepted_results)): + ref = extract_boxed(r['original_solution']) + stu = extract_boxed(sol) + if ref and stu: + g = await grade(ref, stu) + orig_grades.append(g) + else: + orig_grades.append(False) + + # Grade KVs + kv_grades = [] + for i, (sol, r) in enumerate(zip(kv_solutions, accepted_results)): + ref = r['kv_answer'] or extract_boxed(r['kv_solution']) + stu = extract_boxed(sol) + if ref and stu: + g = await grade(ref, stu) + kv_grades.append(g) + else: + kv_grades.append(False) + + orig_acc = sum(orig_grades) / len(orig_grades) * 100 + kv_acc = sum(kv_grades) / len(kv_grades) * 100 + delta = kv_acc - orig_acc + + results[model] = { + 'original_accuracy': orig_acc, + 'kv_accuracy': kv_acc, + 'delta': delta, + 'n': len(accepted_results), + 'orig_correct': sum(orig_grades), + 'kv_correct': sum(kv_grades), + } + print(f" {model}: orig={orig_acc:.1f}%, kv={kv_acc:.1f}%, Δ={delta:+.1f}pp") + + return results + + +async def main(): + # Load MATH Level 5 problems + with open('/home/yurenh2/gap/math_sample_200.json') as f: + all_problems = json.load(f) + + level5 = [p for p in all_problems if p['level'] == 'Level 5' and len(p['solution']) > 50] + selected = random.sample(level5, min(50, len(level5))) + print(f"Selected {len(selected)} Level 5 problems for KV generation") + + # Load any previous progress + progress_file = os.path.join(OUTPUT_DIR, 'kv_generation.json') + if os.path.exists(progress_file): + with open(progress_file) as f: + kv_results = json.load(f) + done_indices = {r['original_index'] for r in kv_results} + print(f"Resuming: {len(kv_results)} already done") + else: + kv_results = [] + done_indices = set() + + # Generate KVs sequentially to avoid rate limits + for i, p in enumerate(selected): + if i in done_indices: + continue + result = await generate_kv_with_repair( + p['problem'], p['solution'], i, + max_repairs=3, n_judges=3 + ) + result['original_index'] = i + result['subject'] = p.get('subject', 'unknown') + kv_results.append(result) + + # Save incrementally + with open(progress_file, 'w') as f: + json.dump(kv_results, f, indent=2) + + # Small delay between problems + await asyncio.sleep(2) + + # Summary + accepted = [r for r in kv_results if r['status'] == 'accepted'] + rejected = [r for r in kv_results if r['status'] == 'rejected'] + no_slots = [r for r in kv_results if r['status'] == 'no_slots'] + errors = [r for r in kv_results if r['status'] == 'error'] + + print(f"\n{'='*60}") + print(f"KV GENERATION SUMMARY") + print(f"{'='*60}") + print(f"Total attempted: {len(selected)}") + print(f"Accepted: {len(accepted)} ({len(accepted)/len(selected)*100:.0f}%)") + print(f"Rejected: {len(rejected)} ({len(rejected)/len(selected)*100:.0f}%)") + print(f"No slots: {len(no_slots)} ({len(no_slots)/len(selected)*100:.0f}%)") + print(f"Errors: {len(errors)} ({len(errors)/len(selected)*100:.0f}%)") + + if accepted: + avg_repairs = sum(r['repair_rounds'] for r in accepted) / len(accepted) + print(f"Avg repair rounds (accepted): {avg_repairs:.1f}") + + # Evaluate accepted variants + if accepted: + print(f"\n{'='*60}") + print(f"EVALUATING {len(accepted)} ACCEPTED KV VARIANTS") + print(f"{'='*60}") + eval_results = await evaluate_kv(accepted, None) + + # Save everything + final_results = { + 'generation_summary': { + 'total': len(selected), + 'accepted': len(accepted), + 'rejected': len(rejected), + 'no_slots': len(no_slots), + 'errors': len(errors), + }, + 'evaluation': eval_results, + 'accepted_variants': accepted, + } + with open(os.path.join(OUTPUT_DIR, 'kv_final_results.json'), 'w') as f: + json.dump(final_results, f, indent=2) + + print(f"\n{'='*60}") + print(f"FINAL RESULTS") + print(f"{'='*60}") + print(f"{'Model':<20} {'Orig%':>8} {'KV%':>8} {'Δ':>8} {'N':>5}") + print("-"*50) + for model, res in eval_results.items(): + print(f"{model:<20} {res['original_accuracy']:>7.1f}% {res['kv_accuracy']:>7.1f}% {res['delta']:>+7.1f} {res['n']:>5}") + +asyncio.run(main()) diff --git a/kv_math_redo.py b/kv_math_redo.py new file mode 100644 index 0000000..6ef7b63 --- /dev/null +++ b/kv_math_redo.py @@ -0,0 +1,275 @@ +#!/usr/bin/env python3 +""" +KV redo: Re-run slot discovery with o3 on no_slots problems, then evaluate all accepted. +""" +import json, asyncio, random, re, os +from datasets import load_dataset +from openai import AsyncOpenAI + +client = AsyncOpenAI() +SEM_O3 = asyncio.Semaphore(3) +SEM_EVAL = asyncio.Semaphore(40) +random.seed(42) + +OUTPUT_DIR = '/home/yurenh2/gap/mini_gap_math_results/kv_200' +REDO_FILE = os.path.join(OUTPUT_DIR, 'kv_redo.json') +LOCK = asyncio.Lock() + +# Prompts +SLOT_DISCOVERY_O3 = """You are a world-class mathematician. Given a math problem and its reference solution, find ALL numerical constants, coefficients, parameters, or specific values that could be changed to create a structurally equivalent but numerically different problem. + +Be AGGRESSIVE in finding slots. Even if a value seems "natural" (like 2, 3, etc.), if changing it to another value would still yield a solvable problem with the same solution technique, list it. + +Examples of mutable slots: +- Coefficients in equations (2x+3 → 5x+7) +- Exponents (x^3 → x^5) +- Bounds or limits (sum from 1 to 100 → sum from 1 to 200) +- Specific numbers in word problems +- Dimensions or sizes +- Modular bases + +Return ONLY valid JSON: +{"mutable_slots": [{"value": "...", "role": "...", "constraints": "..."}, ...]} +If truly no slots exist (every constant is mathematically forced), return: {"mutable_slots": []}""" + +BACK_SYNTHESIS = """You are creating a mathematical variant. Given the original problem, solution, and mutable slots: +- Choose NEW values satisfying constraints +- Rewrite the full problem with new values +- Solve it completely step by step +- The solution MUST use the same mathematical technique + +Return ONLY valid JSON: +{"new_problem": "...", "new_solution": "...", "new_answer": "...", "slot_changes": [{"original": "...", "new": "..."}]}""" + +VERIFY = """You are a rigorous mathematical verifier. Check: +1. Is the problem well-defined? +2. Is every solution step correct? +3. Does it reach the stated answer? + +Reply EXACTLY: VERDICT: ACCEPT or VERDICT: REJECT +REASON: [explanation]""" + +REPAIR = """Fix this rejected variant. +Problem: {problem} +Solution: {solution} +Reason: {reason} +Return ONLY JSON: {{"new_problem": "...", "new_solution": "...", "new_answer": "..."}}""" + +def extract_json(text): + if not text: return None + try: return json.loads(text) + except: pass + m = re.search(r'```(?:json)?\s*(\{[\s\S]*?\})\s*```', text) + if m: + try: return json.loads(m.group(1)) + except: pass + m = re.search(r'\{[\s\S]*\}', text) + if m: + try: return json.loads(m.group()) + except: pass + return None + +async def api_call(messages, model="o3", max_tokens=4000): + sem = SEM_O3 if model == "o3" else SEM_EVAL + async with sem: + for attempt in range(5): + try: + kw = {"model": model, "messages": messages} + if model == "o3": kw["max_completion_tokens"] = max_tokens + else: kw["max_tokens"] = max_tokens; kw["temperature"] = 0 + r = await client.chat.completions.create(**kw) + return r.choices[0].message.content + except Exception as e: + w = min(60, (2**attempt)*3) + if attempt < 4: await asyncio.sleep(w) + else: return None + +async def save(result, results_list): + async with LOCK: + results_list.append(result) + with open(REDO_FILE, 'w') as f: + json.dump(results_list, f) + +async def process_one(problem, solution, idx, results_list): + # Stage 1: o3 slot discovery + slot_text = await api_call( + [{"role": "system", "content": SLOT_DISCOVERY_O3}, + {"role": "user", "content": f"Problem:\n{problem}\n\nSolution:\n{solution}"}], + model="o3", max_tokens=2000) + slots = extract_json(slot_text) if slot_text else None + if not slots or not slots.get('mutable_slots'): + await save({'status': 'no_slots', 'idx': idx}, results_list) + print(f"[{idx}] no_slots (o3)") + return + + n = len(slots['mutable_slots']) + + # Stage 2: o3 back-synthesis + synth_text = await api_call( + [{"role": "system", "content": BACK_SYNTHESIS}, + {"role": "user", "content": f"Original:\n{problem}\n\nSolution:\n{solution}\n\nSlots:\n{json.dumps(slots['mutable_slots'])}\n\nCreate variant."}], + model="o3", max_tokens=6000) + synth = extract_json(synth_text) if synth_text else None + if not synth or not synth.get('new_problem'): + await save({'status': 'error', 'idx': idx, 'reason': 'synthesis failed'}, results_list) + print(f"[{idx}] synth_error") + return + + new_p, new_s, new_a = synth['new_problem'], synth['new_solution'], synth.get('new_answer', '') + + # Stage 3: 3 judges + 3 repair rounds + for rr in range(4): + judges = await asyncio.gather(*[api_call( + [{"role": "system", "content": VERIFY}, + {"role": "user", "content": f"Problem:\n{new_p}\n\nSolution:\n{new_s}"}], + model="o3", max_tokens=500) for _ in range(3)]) + accepts = sum(1 for j in judges if j and 'ACCEPT' in j.upper() and 'REJECT' not in j.upper()) + if accepts == 3: + await save({ + 'status': 'accepted', 'idx': idx, + 'original_problem': problem, 'original_solution': solution, + 'kv_problem': new_p, 'kv_solution': new_s, 'kv_answer': new_a, + 'mutable_slots': slots['mutable_slots'], + 'slot_changes': synth.get('slot_changes', []), + 'repair_rounds': rr, 'n_slots': n, + }, results_list) + print(f"[{idx}] ACCEPTED (round {rr}, {n} slots)") + return + if rr < 3: + reasons = [re.search(r'REASON:\s*(.*)', j or '', re.I) for j in judges] + reason_str = '; '.join(m.group(1)[:200] for m in reasons if m)[:500] + fix = await api_call( + [{"role": "system", "content": REPAIR.format(problem=new_p, solution=new_s, reason=reason_str)}, + {"role": "user", "content": "Fix."}], + model="o3", max_tokens=6000) + fd = extract_json(fix) if fix else None + if fd: + new_p = fd.get('new_problem', new_p) + new_s = fd.get('new_solution', new_s) + new_a = fd.get('new_answer', new_a) + + await save({'status': 'rejected', 'idx': idx}, results_list) + print(f"[{idx}] REJECTED") + +def extract_boxed(text): + if not text: return None + matches = [] + i = 0 + while i < len(text): + idx = text.find('\\boxed{', i) + if idx == -1: break + depth = 1; j = idx + 7 + while j < len(text) and depth > 0: + if text[j] == '{': depth += 1 + elif text[j] == '}': depth -= 1 + j += 1 + if depth == 0: matches.append(text[idx+7:j-1].strip()) + i = j + return matches[-1] if matches else None + +async def evaluate_all(all_accepted): + async def solve(problem, model): + async with SEM_EVAL: + r = await client.chat.completions.create( + model=model, temperature=0, max_tokens=2048, + messages=[{"role": "system", "content": "Solve step by step. Final answer in \\boxed{}."}, + {"role": "user", "content": problem}]) + return r.choices[0].message.content + + async def grade(ref, stu): + async with SEM_EVAL: + r = await client.chat.completions.create( + model="gpt-4o", temperature=0, max_tokens=10, + messages=[{"role": "user", "content": f"Are these equivalent? Ref: {ref}\nStudent: {stu}\nCORRECT or INCORRECT."}]) + t = r.choices[0].message.content.upper() + return 'INCORRECT' not in t and 'CORRECT' in t + + results = {} + for model in ['gpt-4o', 'gpt-4o-mini']: + print(f"\nEval {len(all_accepted)} with {model}...") + orig_sols = await asyncio.gather(*[solve(a['original_problem'], model) for a in all_accepted]) + kv_sols = await asyncio.gather(*[solve(a['kv_problem'], model) for a in all_accepted]) + + og, kg = [], [] + for i, a in enumerate(all_accepted): + ro = extract_boxed(a['original_solution']); so = extract_boxed(orig_sols[i]) + rk = a.get('kv_answer') or extract_boxed(a.get('kv_solution','')); sk = extract_boxed(kv_sols[i]) + og.append(await grade(ro or 'N/A', so or 'N/A') if ro and so else False) + kg.append(await grade(rk or 'N/A', sk or 'N/A') if rk and sk else False) + + oa = sum(og)/len(og)*100; ka = sum(kg)/len(kg)*100 + results[model] = {'orig': oa, 'kv': ka, 'delta': ka-oa, 'n': len(all_accepted), + 'orig_c': sum(og), 'kv_c': sum(kg)} + print(f" {model}: orig={oa:.1f}% kv={ka:.1f}% Δ={ka-oa:+.1f}pp (n={len(all_accepted)})") + return results + +async def main(): + # Load all Level 5 problems (same seed as kv_math_200.py) + subsets = ['algebra', 'number_theory', 'precalculus', 'intermediate_algebra', 'counting_and_probability', 'geometry'] + all_l5 = [] + for s in subsets: + ds = load_dataset('EleutherAI/hendrycks_math', s, split='test') + for item in ds: + if item.get('level') == 'Level 5' and len(item.get('solution','')) > 50: + item['subject'] = s; all_l5.append(item) + random.shuffle(all_l5) + selected = all_l5[:200] + print(f"Total pool: {len(selected)} Level 5 problems") + + # Load previous kv_200 results to find no_slots indices + with open(os.path.join(OUTPUT_DIR, 'kv_generation.json')) as f: + prev = json.load(f) + no_slots_indices = [r['original_index'] for r in prev if r['status'] == 'no_slots'] + prev_accepted = [r for r in prev if r['status'] == 'accepted'] + print(f"Previous: {len(prev_accepted)} accepted, {len(no_slots_indices)} no_slots to redo with o3") + + # Also load kv_50 accepted + kv50_file = '/home/yurenh2/gap/mini_gap_math_results/kv_50/kv_final_results.json' + kv50_accepted = [] + if os.path.exists(kv50_file): + with open(kv50_file) as f: + kv50 = json.load(f) + kv50_accepted = kv50.get('accepted_variants', []) + print(f"kv_50 accepted: {len(kv50_accepted)}") + + # Resume redo progress + redo_results = [] + done_indices = set() + if os.path.exists(REDO_FILE): + with open(REDO_FILE) as f: + redo_results = json.load(f) + done_indices = {r['idx'] for r in redo_results} + print(f"Resuming redo: {len(redo_results)} done") + + remaining = [i for i in no_slots_indices if i not in done_indices] + print(f"Remaining to redo: {len(remaining)}") + + # Process in batches of 8 + for batch_start in range(0, len(remaining), 8): + batch = remaining[batch_start:batch_start+8] + tasks = [process_one(selected[i]['problem'], selected[i]['solution'], i, redo_results) for i in batch] + await asyncio.gather(*tasks) + from collections import Counter + st = Counter(r['status'] for r in redo_results) + print(f"--- Redo progress: {len(redo_results)}/{len(no_slots_indices)}, {dict(st)} ---") + + # Combine all accepted + redo_accepted = [r for r in redo_results if r['status'] == 'accepted'] + all_accepted = kv50_accepted + prev_accepted + redo_accepted + print(f"\nTotal accepted: {len(all_accepted)} (kv50={len(kv50_accepted)}, kv200={len(prev_accepted)}, redo={len(redo_accepted)})") + + # Evaluate + if all_accepted: + eval_results = await evaluate_all(all_accepted) + final = { + 'total_accepted': len(all_accepted), + 'sources': {'kv50': len(kv50_accepted), 'kv200': len(prev_accepted), 'redo': len(redo_accepted)}, + 'evaluation': eval_results, + } + with open(os.path.join(OUTPUT_DIR, 'kv_combined_final.json'), 'w') as f: + json.dump(final, f, indent=2) + print(f"\n{'='*60}\nFINAL COMBINED RESULTS ({len(all_accepted)} KV variants)\n{'='*60}") + for m, r in eval_results.items(): + print(f" {m}: orig={r['orig']:.1f}% kv={r['kv']:.1f}% Δ={r['delta']:+.1f}pp (n={r['n']})") + +asyncio.run(main()) diff --git a/mini_gap_math.py b/mini_gap_math.py new file mode 100644 index 0000000..ff95727 --- /dev/null +++ b/mini_gap_math.py @@ -0,0 +1,397 @@ +#!/usr/bin/env python3 +""" +Mini-GAP-MATH: Apply GAP surface renaming to MATH dataset and evaluate. +Proves GAP framework generalizes beyond Putnam. +""" + +import json +import re +import random +import os +import sys +import time +import argparse +from pathlib import Path + +random.seed(42) + +# ============================================================ +# Step 1: Extract variables from MATH problems +# ============================================================ + +def extract_latex_variables(problem_text): + """Extract single-letter and short math variables from LaTeX.""" + # Find variables inside $...$ math mode + math_segments = re.findall(r'\$([^$]+)\$', problem_text) + all_text = ' '.join(math_segments) + + # Common math variables: single letters, subscripted versions + vars_found = set() + + # Single-letter variables (a-z, A-Z) used standalone in math + for m in re.finditer(r'(?<![a-zA-Z\\])([a-zA-Z])(?![a-zA-Z{])', all_text): + v = m.group(1) + # Exclude common function names and constants + if v not in {'e', 'i', 'd', 'f', 'g', 'h', 'sin', 'cos', 'tan', 'log', 'ln', 'lim', 'max', 'min'}: + vars_found.add(v) + + # Subscripted variables like x_1, a_n + for m in re.finditer(r'([a-zA-Z])_\{?([a-zA-Z0-9]+)\}?', all_text): + vars_found.add(f"{m.group(1)}_{m.group(2)}") + + return list(vars_found) + +# ============================================================ +# Step 2: Surface renaming - Garbled String (GS) variant +# ============================================================ + +def generate_garbled_name(length=None): + """Generate a random alphanumeric string (4-12 chars).""" + if length is None: + length = random.randint(4, 12) + chars = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789' + return ''.join(random.choices(chars, k=length)) + +def generate_descriptive_long_name(var_name): + """Generate a descriptive long confusing name (DLC).""" + # Pool of unrelated words + words = [ + 'marshmallow', 'butterfly', 'telescope', 'pineapple', 'volcano', + 'watermelon', 'dinosaur', 'moonlight', 'umbrella', 'strawberry', + 'caterpillar', 'sunflower', 'kangaroo', 'chocolate', 'thunderbolt', + 'penguin', 'trampoline', 'avalanche', 'cinnamon', 'dragonfly', + 'elephant', 'fireworks', 'giraffe', 'honeybee', 'igloo', + 'jellyfish', 'kaleidoscope', 'lighthouse', 'mandarin', 'nutmeg', + 'origami', 'platypus', 'quicksilver', 'rainbow', 'saxophone', + 'tumbleweed', 'unicorn', 'velvet', 'whirlpool', 'xylophone' + ] + n_words = random.randint(2, 4) + return ''.join(random.sample(words, n_words)) + +def apply_surface_rename(problem_text, solution_text, var_map): + """Apply variable renaming to both problem and solution.""" + new_problem = problem_text + new_solution = solution_text + + # Sort by length (longest first) to avoid partial replacements + sorted_vars = sorted(var_map.keys(), key=len, reverse=True) + + for old_var, new_var in [(v, var_map[v]) for v in sorted_vars]: + # Handle subscripted variables + if '_' in old_var: + base, sub = old_var.split('_', 1) + # Replace in LaTeX: x_{1} or x_1 + patterns = [ + (rf'(?<![a-zA-Z]){re.escape(base)}_\{{{re.escape(sub)}\}}', f'{new_var}'), + (rf'(?<![a-zA-Z]){re.escape(base)}_{re.escape(sub)}(?![a-zA-Z0-9])', f'{new_var}'), + ] + for pat, repl in patterns: + new_problem = re.sub(pat, repl, new_problem) + new_solution = re.sub(pat, repl, new_solution) + else: + # Single letter variable - be careful with context + # Replace inside math mode ($...$) only + def replace_in_math(text, old, new): + def replacer(match): + content = match.group(1) + # Replace standalone variable + content = re.sub( + rf'(?<![a-zA-Z\\]){re.escape(old)}(?![a-zA-Z])', + new, content + ) + return f'${content}$' + return re.sub(r'\$([^$]+)\$', replacer, text) + + new_problem = replace_in_math(new_problem, old_var, new_var) + new_solution = replace_in_math(new_solution, old_var, new_var) + + return new_problem, new_solution + +def create_variants(problems): + """Create GS and DLC variants for each problem.""" + results = [] + for idx, prob in enumerate(problems): + variables = extract_latex_variables(prob['problem']) + if len(variables) == 0: + # No variables to rename, skip + continue + + # Create variable mappings + used_gs = set() + used_dlc = set() + gs_map = {} + dlc_map = {} + + for v in variables: + # Garbled String + gs_name = generate_garbled_name() + while gs_name in used_gs: + gs_name = generate_garbled_name() + used_gs.add(gs_name) + gs_map[v] = gs_name + + # Descriptive Long Confusing + dlc_name = generate_descriptive_long_name(v) + while dlc_name in used_dlc: + dlc_name = generate_descriptive_long_name(v) + used_dlc.add(dlc_name) + dlc_map[v] = dlc_name + + gs_problem, gs_solution = apply_surface_rename( + prob['problem'], prob['solution'], gs_map + ) + dlc_problem, dlc_solution = apply_surface_rename( + prob['problem'], prob['solution'], dlc_map + ) + + results.append({ + 'index': idx, + 'subject': prob.get('subject', 'unknown'), + 'level': prob.get('level', 'unknown'), + 'original': { + 'problem': prob['problem'], + 'solution': prob['solution'], + }, + 'garbled_string': { + 'problem': gs_problem, + 'solution': gs_solution, + 'map': gs_map, + }, + 'descriptive_long_confusing': { + 'problem': dlc_problem, + 'solution': dlc_solution, + 'map': dlc_map, + }, + 'variables': variables, + }) + + return results + + +# ============================================================ +# Step 3: Evaluation with local models +# ============================================================ + +def extract_boxed_answer(text): + """Extract answer from \\boxed{...} in MATH-style solutions.""" + # Find the last \boxed{...} + matches = re.findall(r'\\boxed\{([^}]*(?:\{[^}]*\}[^}]*)*)\}', text) + if matches: + return matches[-1].strip() + return None + +def normalize_answer(ans): + """Normalize answer for comparison.""" + if ans is None: + return None + ans = ans.strip() + # Remove \$ signs + ans = ans.replace('$', '') + # Remove spaces + ans = ans.replace(' ', '') + # Normalize fractions + ans = ans.replace('\\dfrac', '\\frac') + ans = ans.replace('\\tfrac', '\\frac') + return ans + +def check_answer(generated, reference_solution): + """Check if generated answer matches reference.""" + ref_answer = extract_boxed_answer(reference_solution) + gen_answer = extract_boxed_answer(generated) + + if ref_answer is None or gen_answer is None: + return False + + return normalize_answer(ref_answer) == normalize_answer(gen_answer) + + +def run_inference_batch(model, tokenizer, problems, device, batch_size=4): + """Run inference on a batch of problems.""" + import torch + + results = [] + for i in range(0, len(problems), batch_size): + batch = problems[i:i+batch_size] + prompts = [] + for p in batch: + messages = [ + {"role": "system", "content": "You are an expert mathematician. Solve the problem step by step and put your final answer in \\boxed{}."}, + {"role": "user", "content": p} + ] + try: + formatted = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) + except Exception: + formatted = f"Problem: {p}\n\nSolve step by step. Put final answer in \\boxed{{}}.\n\nSolution:" + prompts.append(formatted) + + inputs = tokenizer(prompts, return_tensors="pt", padding=True, truncation=True, max_length=2048).to(device) + + with torch.no_grad(): + output_ids = model.generate( + **inputs, + max_new_tokens=512, + do_sample=False, + pad_token_id=tokenizer.pad_token_id, + ) + + generated = tokenizer.batch_decode(output_ids[:, inputs.input_ids.shape[1]:], skip_special_tokens=True) + results.extend([g.strip() for g in generated]) + + if (i // batch_size) % 10 == 0: + print(f" Progress: {min(i+batch_size, len(problems))}/{len(problems)}") + + return results + + +def evaluate_model(model_name, variants_data, device="cuda:2", batch_size=4): + """Evaluate a single model on original + variants.""" + import torch + from transformers import AutoModelForCausalLM, AutoTokenizer + + print(f"\n{'='*60}") + print(f"Loading model: {model_name}") + print(f"{'='*60}") + + tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side='left') + if tokenizer.pad_token_id is None: + tokenizer.pad_token_id = tokenizer.eos_token_id or 0 + + dtype = torch.float16 if 'cuda' in device else torch.float32 + model = AutoModelForCausalLM.from_pretrained( + model_name, device_map=device, torch_dtype=dtype + ) + model.eval() + + results = {'model': model_name, 'variants': {}} + + for variant_type in ['original', 'garbled_string', 'descriptive_long_confusing']: + print(f"\n--- Evaluating {variant_type} ---") + + problems = [item[variant_type]['problem'] for item in variants_data] + solutions = [item[variant_type]['solution'] for item in variants_data] + + generated = run_inference_batch(model, tokenizer, problems, device, batch_size) + + correct = 0 + total = len(problems) + per_item = [] + for j, (gen, sol) in enumerate(zip(generated, solutions)): + is_correct = check_answer(gen, sol) + correct += int(is_correct) + per_item.append({ + 'index': variants_data[j]['index'], + 'correct': is_correct, + 'generated_answer': extract_boxed_answer(gen), + 'reference_answer': extract_boxed_answer(sol), + }) + + acc = correct / total * 100 if total > 0 else 0 + results['variants'][variant_type] = { + 'accuracy': acc, + 'correct': correct, + 'total': total, + 'per_item': per_item, + } + print(f" {variant_type}: {correct}/{total} = {acc:.1f}%") + + # Compute deltas + orig_acc = results['variants']['original']['accuracy'] + for vt in ['garbled_string', 'descriptive_long_confusing']: + var_acc = results['variants'][vt]['accuracy'] + results['variants'][vt]['delta'] = var_acc - orig_acc + + # Cleanup + del model + del tokenizer + torch.cuda.empty_cache() + + return results + + +def main(): + parser = argparse.ArgumentParser(description='Mini-GAP-MATH experiment') + parser.add_argument('--step', choices=['prepare', 'evaluate', 'all'], default='all') + parser.add_argument('--models', nargs='+', default=['Qwen/Qwen2.5-7B-Instruct']) + parser.add_argument('--device', default='cuda:2') + parser.add_argument('--batch-size', type=int, default=4) + parser.add_argument('--max-problems', type=int, default=200) + parser.add_argument('--input', default='/home/yurenh2/gap/math_sample_200.json') + parser.add_argument('--output-dir', default='/home/yurenh2/gap/mini_gap_math_results') + args = parser.parse_args() + + os.makedirs(args.output_dir, exist_ok=True) + variants_file = os.path.join(args.output_dir, 'math_variants.json') + + if args.step in ['prepare', 'all']: + print("="*60) + print("Step 1: Loading MATH problems and creating variants") + print("="*60) + + with open(args.input) as f: + problems = json.load(f) + + problems = problems[:args.max_problems] + print(f"Loaded {len(problems)} problems") + + variants = create_variants(problems) + print(f"Created variants for {len(variants)} problems") + + with open(variants_file, 'w') as f: + json.dump(variants, f, indent=2) + print(f"Saved to {variants_file}") + + # Show a sample + if variants: + v = variants[0] + print(f"\nSample problem (original):") + print(f" {v['original']['problem'][:200]}...") + print(f" Variables: {v['variables']}") + print(f"\nGS variant:") + print(f" {v['garbled_string']['problem'][:200]}...") + print(f" Map: {v['garbled_string']['map']}") + + if args.step in ['evaluate', 'all']: + print("\n" + "="*60) + print("Step 2: Evaluating models") + print("="*60) + + with open(variants_file) as f: + variants_data = json.load(f) + + all_results = [] + for model_name in args.models: + try: + result = evaluate_model( + model_name, variants_data, + device=args.device, batch_size=args.batch_size + ) + all_results.append(result) + + # Save incrementally + out_file = os.path.join(args.output_dir, 'evaluation_results.json') + with open(out_file, 'w') as f: + json.dump(all_results, f, indent=2) + + except Exception as e: + print(f"ERROR with {model_name}: {e}") + import traceback + traceback.print_exc() + + # Print summary table + print("\n" + "="*60) + print("RESULTS SUMMARY") + print("="*60) + print(f"{'Model':<35} {'Original':>10} {'GS':>10} {'GS Δ':>8} {'DLC':>10} {'DLC Δ':>8}") + print("-"*85) + for r in all_results: + m = r['model'].split('/')[-1] + orig = r['variants']['original']['accuracy'] + gs = r['variants']['garbled_string']['accuracy'] + gs_d = r['variants']['garbled_string']['delta'] + dlc = r['variants']['descriptive_long_confusing']['accuracy'] + dlc_d = r['variants']['descriptive_long_confusing']['delta'] + print(f"{m:<35} {orig:>9.1f}% {gs:>9.1f}% {gs_d:>+7.1f} {dlc:>9.1f}% {dlc_d:>+7.1f}") + + +if __name__ == '__main__': + main() diff --git a/mini_gap_math_api.py b/mini_gap_math_api.py new file mode 100644 index 0000000..254db9c --- /dev/null +++ b/mini_gap_math_api.py @@ -0,0 +1,192 @@ +#!/usr/bin/env python3 +""" +Mini-GAP-MATH: Evaluate MATH variants using OpenAI API. +""" + +import json +import re +import os +import sys +import asyncio +import time +import argparse +from pathlib import Path +from openai import AsyncOpenAI + +client = AsyncOpenAI() +SEMAPHORE = asyncio.Semaphore(50) # max concurrent requests + +# ============================================================ +# Answer extraction and checking +# ============================================================ + +def extract_boxed_answer(text): + """Extract answer from \\boxed{...}.""" + if not text: + return None + # Handle nested braces + matches = [] + i = 0 + while i < len(text): + idx = text.find('\\boxed{', i) + if idx == -1: + break + # Find matching closing brace + depth = 1 + j = idx + 7 + while j < len(text) and depth > 0: + if text[j] == '{': + depth += 1 + elif text[j] == '}': + depth -= 1 + j += 1 + if depth == 0: + matches.append(text[idx+7:j-1].strip()) + i = j + return matches[-1] if matches else None + +def normalize_answer(ans): + """Normalize answer for comparison.""" + if ans is None: + return None + ans = ans.strip() + ans = ans.replace('$', '').replace(' ', '') + ans = ans.replace('\\dfrac', '\\frac').replace('\\tfrac', '\\frac') + ans = ans.replace('\\left', '').replace('\\right', '') + ans = ans.replace('\\,', '').replace('\\;', '') + return ans + +def check_answer(generated, reference_solution): + """Check if generated answer matches reference.""" + ref_answer = extract_boxed_answer(reference_solution) + gen_answer = extract_boxed_answer(generated) + if ref_answer is None or gen_answer is None: + return False + return normalize_answer(ref_answer) == normalize_answer(gen_answer) + +# ============================================================ +# API calls +# ============================================================ + +async def solve_problem(problem_text, model="gpt-4o-mini"): + """Solve a single problem using OpenAI API.""" + async with SEMAPHORE: + try: + resp = await client.chat.completions.create( + model=model, + messages=[ + {"role": "system", "content": "You are an expert mathematician. Solve the problem step by step and put your final answer in \\boxed{}."}, + {"role": "user", "content": problem_text} + ], + max_tokens=2048, + temperature=0, + ) + return resp.choices[0].message.content + except Exception as e: + print(f" API error: {e}") + return None + +async def evaluate_variant(variant_data, variant_type, model): + """Evaluate all problems for one variant type.""" + problems = [item[variant_type]['problem'] for item in variant_data] + solutions = [item[variant_type]['solution'] for item in variant_data] + + print(f"\n--- Evaluating {variant_type} ({len(problems)} problems) ---") + + # Launch all requests concurrently + tasks = [solve_problem(p, model) for p in problems] + generated = await asyncio.gather(*tasks) + + correct = 0 + total = len(problems) + per_item = [] + for j, (gen, sol) in enumerate(zip(generated, solutions)): + is_correct = check_answer(gen or "", sol) + correct += int(is_correct) + per_item.append({ + 'index': variant_data[j]['index'], + 'correct': is_correct, + 'generated_answer': extract_boxed_answer(gen or ""), + 'reference_answer': extract_boxed_answer(sol), + }) + + acc = correct / total * 100 if total > 0 else 0 + print(f" {variant_type}: {correct}/{total} = {acc:.1f}%") + + return { + 'accuracy': acc, + 'correct': correct, + 'total': total, + 'per_item': per_item, + } + +async def evaluate_model(model, variant_data, output_dir): + """Evaluate a model on all variants.""" + print(f"\n{'='*60}") + print(f"Evaluating model: {model}") + print(f"{'='*60}") + + results = {'model': model, 'variants': {}} + + for vt in ['original', 'garbled_string', 'descriptive_long_confusing']: + results['variants'][vt] = await evaluate_variant(variant_data, vt, model) + + # Compute deltas + orig_acc = results['variants']['original']['accuracy'] + for vt in ['garbled_string', 'descriptive_long_confusing']: + results['variants'][vt]['delta'] = results['variants'][vt]['accuracy'] - orig_acc + + # Save + out_file = os.path.join(output_dir, f'{model.replace("/", "_")}_results.json') + with open(out_file, 'w') as f: + json.dump(results, f, indent=2) + print(f" Saved to {out_file}") + + return results + +async def main(): + parser = argparse.ArgumentParser() + parser.add_argument('--models', nargs='+', default=['gpt-4o-mini']) + parser.add_argument('--variants-file', default='/home/yurenh2/gap/mini_gap_math_results/math_variants.json') + parser.add_argument('--output-dir', default='/home/yurenh2/gap/mini_gap_math_results') + parser.add_argument('--concurrency', type=int, default=50) + args = parser.parse_args() + + global SEMAPHORE + SEMAPHORE = asyncio.Semaphore(args.concurrency) + + os.makedirs(args.output_dir, exist_ok=True) + + with open(args.variants_file) as f: + variant_data = json.load(f) + + print(f"Loaded {len(variant_data)} problems with variants") + + all_results = [] + for model in args.models: + result = await evaluate_model(model, variant_data, args.output_dir) + all_results.append(result) + + # Print summary + print("\n" + "="*80) + print("MINI-GAP-MATH RESULTS SUMMARY") + print("="*80) + print(f"{'Model':<25} {'Original':>10} {'GS':>10} {'GS Δ':>8} {'DLC':>10} {'DLC Δ':>8}") + print("-"*75) + for r in all_results: + m = r['model'] + orig = r['variants']['original']['accuracy'] + gs = r['variants']['garbled_string']['accuracy'] + gs_d = r['variants']['garbled_string']['delta'] + dlc = r['variants']['descriptive_long_confusing']['accuracy'] + dlc_d = r['variants']['descriptive_long_confusing']['delta'] + print(f"{m:<25} {orig:>9.1f}% {gs:>9.1f}% {gs_d:>+7.1f} {dlc:>9.1f}% {dlc_d:>+7.1f}") + + # Save combined + combined_file = os.path.join(args.output_dir, 'all_api_results.json') + with open(combined_file, 'w') as f: + json.dump(all_results, f, indent=2) + print(f"\nAll results saved to {combined_file}") + +if __name__ == '__main__': + asyncio.run(main()) diff --git a/mini_gap_math_regrade.py b/mini_gap_math_regrade.py new file mode 100644 index 0000000..f0ba12f --- /dev/null +++ b/mini_gap_math_regrade.py @@ -0,0 +1,166 @@ +#!/usr/bin/env python3 +""" +Multi-grader consistency analysis for Mini-GAP-MATH. +Uses multiple LLM graders to evaluate the same (problem, solution) pairs. +Computes Cohen's kappa and percent agreement. +""" + +import json +import os +import asyncio +import argparse +from openai import AsyncOpenAI + +client = AsyncOpenAI() +SEMAPHORE = asyncio.Semaphore(30) + +GRADING_PROMPT = """You are a strict math grader. You are given a math problem, its reference solution, and a student's solution. + +Determine if the student's final answer is CORRECT or INCORRECT. +- For numerical answers: the answer must match exactly (after simplification). +- For expressions: must be mathematically equivalent. +- Ignore intermediate steps; focus only on the final answer. + +Respond with EXACTLY one word: CORRECT or INCORRECT""" + + +async def grade_one(problem, reference_solution, student_solution, model): + """Grade a single (problem, student_solution) pair.""" + async with SEMAPHORE: + try: + resp = await client.chat.completions.create( + model=model, + messages=[ + {"role": "system", "content": GRADING_PROMPT}, + {"role": "user", "content": f"Problem:\n{problem}\n\nReference Solution:\n{reference_solution}\n\nStudent Solution:\n{student_solution}"} + ], + max_tokens=10, + temperature=0, + ) + answer = resp.choices[0].message.content.strip().upper() + return 'CORRECT' in answer + except Exception as e: + print(f" Grading error: {e}") + return None + + +async def grade_all(problems, ref_solutions, student_solutions, model): + """Grade all solutions with a given model.""" + tasks = [ + grade_one(p, r, s, model) + for p, r, s in zip(problems, ref_solutions, student_solutions) + ] + return await asyncio.gather(*tasks) + + +def cohens_kappa(labels1, labels2): + """Compute Cohen's kappa between two sets of binary labels.""" + assert len(labels1) == len(labels2) + n = len(labels1) + # Filter out None + valid = [(a, b) for a, b in zip(labels1, labels2) if a is not None and b is not None] + if not valid: + return 0.0, 0 + n = len(valid) + agree = sum(1 for a, b in valid if a == b) + p_o = agree / n # observed agreement + + # Expected agreement + p1_yes = sum(1 for a, _ in valid if a) / n + p2_yes = sum(1 for _, b in valid if b) / n + p_e = p1_yes * p2_yes + (1 - p1_yes) * (1 - p2_yes) + + if p_e == 1.0: + return 1.0, n + kappa = (p_o - p_e) / (1 - p_e) + return kappa, n + + +async def main(): + parser = argparse.ArgumentParser() + parser.add_argument('--results-file', required=True, help='Path to evaluation results JSON') + parser.add_argument('--variants-file', required=True, help='Path to math_variants.json') + parser.add_argument('--grader-models', nargs='+', default=['gpt-4o', 'gpt-4o-mini']) + parser.add_argument('--variant-type', default='original') + parser.add_argument('--output', default='/home/yurenh2/gap/mini_gap_math_results/regrade_consistency.json') + args = parser.parse_args() + + # Load original eval results + with open(args.results_file) as f: + eval_data = json.load(f) + + # Handle both list and single-model format + if isinstance(eval_data, list): + eval_results = eval_data[0] + else: + eval_results = eval_data + + # Load variant data for problems + with open(args.variants_file) as f: + variants = json.load(f) + + variant_type = args.variant_type + # Get problems and student solutions + problems = [v[variant_type]['problem'] for v in variants] + ref_solutions = [v[variant_type]['solution'] for v in variants] + + # Extract student solutions from eval results (we need the raw generated text) + # Since we don't store raw text, we'll re-generate or use reference answers + # Actually, for regrade we need the student solutions. Let's use the per_item data + # to identify which problems were attempted and grade the reference vs generated answers. + + # For this analysis, we'll ask graders to evaluate whether the reference answer + # from each variant matches the original. This tests grader consistency. + + print(f"Loaded {len(variants)} problems") + print(f"Grader models: {args.grader_models}") + print(f"Variant type: {variant_type}") + + # Strategy: For each problem, take the ORIGINAL solution as "student solution" + # and the GS/DLC variant solution as reference, and see if graders agree. + # But actually the more useful thing: re-grade the ORIGINAL problems with + # multiple graders and compare grades. + + # Simpler approach: Grade the reference solutions with each model to check + # if graders are consistent on "obviously correct" answers + + all_grades = {} + for model in args.grader_models: + print(f"\n--- Grading with {model} ---") + grades = await grade_all(problems, ref_solutions, ref_solutions, model) + all_grades[model] = grades + correct_count = sum(1 for g in grades if g is True) + none_count = sum(1 for g in grades if g is None) + print(f" {model}: {correct_count}/{len(grades)} correct, {none_count} errors") + + # Compute pairwise kappa + models = list(all_grades.keys()) + print("\n" + "="*60) + print("PAIRWISE COHEN'S KAPPA") + print("="*60) + + results = {'models': models, 'kappas': {}, 'agreement': {}} + + for i in range(len(models)): + for j in range(i+1, len(models)): + m1, m2 = models[i], models[j] + kappa, n = cohens_kappa(all_grades[m1], all_grades[m2]) + pct_agree = sum(1 for a, b in zip(all_grades[m1], all_grades[m2]) + if a is not None and b is not None and a == b) + total_valid = sum(1 for a, b in zip(all_grades[m1], all_grades[m2]) + if a is not None and b is not None) + pct = pct_agree / total_valid * 100 if total_valid > 0 else 0 + + key = f"{m1}_vs_{m2}" + results['kappas'][key] = kappa + results['agreement'][key] = pct + print(f" {m1} vs {m2}: κ={kappa:.3f}, agreement={pct:.1f}% (n={n})") + + # Save + with open(args.output, 'w') as f: + json.dump(results, f, indent=2) + print(f"\nSaved to {args.output}") + + +if __name__ == '__main__': + asyncio.run(main()) diff --git a/putnam-bench-anon/.gitignore b/putnam-bench-anon/.gitignore new file mode 100644 index 0000000..27c50b5 --- /dev/null +++ b/putnam-bench-anon/.gitignore @@ -0,0 +1,214 @@ +# Byte-compiled / optimized / DLL files +__pycache__/ +*.py[cod] +*$py.class + +# C extensions +*.so + +# Distribution / packaging +.Python +build/ +develop-eggs/ +dist/ +downloads/ +eggs/ +.eggs/ +lib/ +lib64/ +parts/ +sdist/ +var/ +wheels/ +share/python-wheels/ +*.egg-info/ +.installed.cfg +*.egg +MANIFEST + +# PyInstaller +# Usually these files are written by a python script from a template +# before PyInstaller builds the exe, so as to inject date/other infos into it. +*.manifest +*.spec + +# Installer logs +pip-log.txt +pip-delete-this-directory.txt + +# Unit test / coverage reports +htmlcov/ +.tox/ +.nox/ +.coverage +.coverage.* +.cache +nosetests.xml +coverage.xml +*.cover +*.py,cover +.hypothesis/ +.pytest_cache/ +cover/ + +# Translations +*.mo +*.pot + +# Django stuff: +*.log +local_settings.py +db.sqlite3 +db.sqlite3-journal + +# Flask stuff: +instance/ +.webassets-cache + +# Scrapy stuff: +.scrapy + +# Sphinx documentation +docs/_build/ + +# PyBuilder +.pybuilder/ +target/ + +# Jupyter Notebook +.ipynb_checkpoints + +# IPython +profile_default/ +ipython_config.py + +# pyenv +# For a library or package, you might want to ignore these files since the code is +# intended to run in multiple environments; otherwise, check them in: +# .python-version + +# pipenv +# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. +# However, in case of collaboration, if having platform-specific dependencies or dependencies +# having no cross-platform support, pipenv may install dependencies that don't work, or not +# install all needed dependencies. +#Pipfile.lock + +# poetry +# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control. +# This is especially recommended for binary packages to ensure reproducibility, and is more +# commonly ignored for libraries. +# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control +#poetry.lock + +# pdm +# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control. +#pdm.lock +# pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it +# in version control. +# https://pdm.fming.dev/#use-with-ide +.pdm.toml + +# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm +__pypackages__/ + +# Celery stuff +celerybeat-schedule +celerybeat.pid + +# SageMath parsed files +*.sage.py + +# Environments +.env +.venv +env/ +venv/ +ENV/ +env.bak/ +venv.bak/ + +# Spyder project settings +.spyderproject +.spyproject + +# Rope project settings +.ropeproject + +# mkdocs documentation +/site + +# mypy +.mypy_cache/ +.dmypy.json +dmypy.json + +# Pyre type checker +.pyre/ + +# pytype static type analyzer +.pytype/ + +# Cython debug symbols +cython_debug/ + +# PyCharm +# JetBrains specific template is maintained in a separate JetBrains.gitignore that can +# be added to the global gitignore or merged into this project gitignore. For a PyCharm +# project, it is not recommended to check the API keys and other secrets into version control. +.idea/ + +# VS Code +.vscode/ + +# HuggingFace cache and models +.cache/ +models/ +transformers_cache/ +huggingface_cache/ + +# CUDA cache +*.cu +*.cubin + +# Model checkpoints and outputs +checkpoints/ +outputs/ +results/ +logs/ + +# API keys and sensitive configuration +.env +*.key +api_keys.txt +secrets.json + +# Configuration files with sensitive data +*config*.json +.putnam_config.json +.putnam_env + +# Temporary files +*.tmp +*.temp +.DS_Store +Thumbs.db + +# Windows Zone Identifier files +*:Zone.Identifier + +# Data and experiment files +data/ +experiments/ +test_results/ + +# Large files +*.bin +*.safetensors +*.h5 +*.pt +*.pth + +# Virtual environments +putnam-local/ +putnam-env/
\ No newline at end of file diff --git a/putnam-bench-anon/README.md b/putnam-bench-anon/README.md new file mode 100644 index 0000000..8010895 --- /dev/null +++ b/putnam-bench-anon/README.md @@ -0,0 +1,630 @@ +# Putnam Mathematical Problem Solver + +A comprehensive system for evaluating AI models on mathematical problem solving using the Putnam Competition dataset. This project provides a unified interface for testing multiple AI providers (cloud and local) on complex mathematical problems with automated grading. + +## Features + +- **Multi-Provider Support**: 8 AI providers including OpenAI, Anthropic, Google Gemini, xAI, Kimi, VLLM, VLLM Direct, and HuggingFace +- **Dynamic Model Selection**: Runtime model configuration for optimal cost/performance +- **Robust Evaluation**: Specialized prompts for mathematical problem solving and grading +- **Local Model Support**: Run models locally via VLLM server or direct HuggingFace inference +- **GPU Optimization**: Tested on RTX 3060 with CUDA support +- **Cost Estimation**: Built-in cost calculation for different providers +- **Async Processing**: Efficient handling of large datasets +- **Error Recovery**: Intelligent retry logic and JSON parsing fallbacks +- **Unified CLI**: Single command-line interface for all operations +- **Progress Tracking**: Real-time progress bars with success/failure statistics +- **Multi-Variant Testing**: Test all 6 problem variants with a single command + +## Architecture + +The project follows a clean, modular architecture: + +``` +putnam-bench-anon/ +├── putnam_cli.py # 🎯 Main CLI interface (your primary entry point) +├── loader/ # 🔧 Core evaluation engine +│ ├── __init__.py +│ ├── providers/ # AI provider implementations +│ └── utils/ # Utility functions +├── scripts/ # 📋 Internal scripts (used by CLI) +│ ├── batch_evaluate.py # Batch evaluation logic +│ ├── health_check.py # Health checking utilities +│ ├── benchmark.py # Performance benchmarking +│ └── compare_*.py # Analysis scripts +├── dataset/ # 📚 Problem dataset +├── results/ # 📊 Evaluation results +└── requirements*.txt # 📦 Dependency management +``` + +**Key principle**: Use `putnam_cli.py` for all operations - it provides a clean, consistent interface to all functionality. + +## Installation + +### Quick Start (Automated) + +The easiest way to get started is using our automated installation script: + +```bash +# Clone the repository +git clone <repository-url> +cd putnam-bench-anon + +# One-command setup (installs everything and configures) +python install.py + +# Or quick install (core packages only) +python install.py --quick +``` + +### Manual Installation + +#### Environment Setup + +We recommend using a dedicated conda environment to avoid dependency conflicts: + +```bash +# Create dedicated environment +conda create -n putnam-local python=3.10 -y +conda activate putnam-local + +# Clone the repository +git clone <repository-url> +cd putnam-bench-anon +``` + +#### Choose Your Installation Type + +**Option 1: Full Installation (recommended)** +```bash +# Install all dependencies including local model support +pip install -r requirements.txt + +# Install PyTorch with CUDA support (for local models) +pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124 +``` + +**Quick tqdm install for progress bars:** +```bash +pip install tqdm +``` + +**Option 2: Minimal Installation (cloud providers only)** +```bash +# Install only core dependencies for cloud providers +pip install -r requirements-minimal.txt +``` + +**Option 3: Local Models Only** +```bash +# Install dependencies for local model inference +pip install -r requirements-local.txt + +# Install PyTorch with CUDA support +pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124 +``` + +**Option 4: Custom Installation** +```bash +# Install core packages manually +pip install openai anthropic google-generativeai transformers accelerate six + +# Optional: For local models +pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124 + +# Optional: For VLLM (see performance notes below) +pip install vllm +``` + +### GPU Requirements + +- **Recommended**: RTX 3060+ with 6GB+ VRAM +- **Minimum**: Any CUDA-compatible GPU with 4GB+ VRAM +- **CPU fallback**: Supported but not recommended for performance + +### Requirements Files Explained + +The project includes several requirements files for different use cases: + +- **`requirements.txt`**: Complete installation with all features (recommended) + - Includes cloud providers, local models, data analysis tools + - Best for full functionality and development + +- **`requirements-minimal.txt`**: Minimal installation for cloud providers only + - OpenAI, Anthropic, Google Gemini support + - Smallest footprint, fastest installation + - Best for cloud-only usage + +- **`requirements-local.txt`**: Specialized for local model inference + - HuggingFace transformers, VLLM, GPU utilities + - Includes model optimization tools + - Best for privacy-focused or offline usage + +Choose the requirements file that matches your intended usage pattern. + +## Quick Start + +### Basic Usage + +```python +import asyncio +from loader import create_loader +import json + +async def main(): + # Create a loader for any provider + loader = create_loader("openai", solver_model="gpt-4o-mini", grader_model="gpt-4o-mini") + + # Load a problem + with open("dataset/2000-A-1.json") as f: + problem = json.load(f) + + # Test the problem + result = await loader.test_single_problem(problem, variant_type="original") + print(f"Grade: {result['final_grade']}") + print(f"Solution: {result['solution']}") + +asyncio.run(main()) +``` + +### Command Line Usage + +```bash +# Activate environment first +conda activate putnam-local + +# Use the unified CLI for all operations +python putnam_cli.py info # Show system info +python putnam_cli.py health --provider openai +python putnam_cli.py test --provider openai +python putnam_cli.py solve dataset/2000-A-1.json --provider openai +python putnam_cli.py batch dataset/ --provider openai --max-files 10 +python putnam_cli.py multi-test --provider openai --max-files 50 --concurrent 100 +``` + +## Multi-Provider Support + +The system supports **8 AI providers** with a unified interface: + +### Cloud Providers + +#### 1. **OpenAI** +- **Models**: GPT-4o, GPT-4o-mini, o1, o3 +- **Best for**: High-quality solutions and grading +- **Setup**: Requires `OPENAI_API_KEY` environment variable + +#### 2. **Anthropic** +- **Models**: Claude-3.5-Sonnet, Claude-3.5-Haiku +- **Best for**: Detailed reasoning and explanation +- **Setup**: Requires `ANTHROPIC_API_KEY` environment variable + +#### 3. **Google Gemini** +- **Models**: Gemini-1.5-Pro, Gemini-1.5-Flash +- **Best for**: Cost-effective high-performance solving +- **Setup**: Requires `GOOGLE_API_KEY` environment variable + +#### 4. **xAI** +- **Models**: Grok-3, Grok-2 +- **Best for**: Advanced reasoning and mathematical problem solving +- **Setup**: Requires `XAI_API_KEY` environment variable + +#### 5. **Kimi (Moonshot AI)** +- **Models**: moonshot-v1-8k, moonshot-v1-32k, moonshot-v1-128k +- **Best for**: Chinese and English mathematical problem solving +- **Setup**: Requires `MOONSHOT_API_KEY` environment variable + +### Local Providers + +#### 6. **HuggingFace** +- **Models**: Any HuggingFace model (tested: GPT-2, DialoGPT) +- **Performance**: Fast loading (~40s first time, then cached) +- **Cost**: Free after setup +- **Privacy**: Complete local inference +- **Best for**: Development, testing, cost-sensitive applications + +#### 7. **VLLM Server** +- **Models**: Any VLLM-compatible model +- **Best for**: Multi-user server deployment +- **Setup**: Requires running separate VLLM server +- **Performance**: Good for sustained workloads + +#### 8. **VLLM Direct** ⚠️ +- **Models**: Any VLLM-compatible model +- **Performance**: **NOT RECOMMENDED** - 72+ second initialization +- **Issue**: Extremely slow first load due to graph compilation +- **Use case**: Only for research/benchmarking VLLM internals + +```python +# NOT RECOMMENDED for interactive use +loader = create_loader("vllm_direct", solver_model="gpt2") # Takes 72+ seconds! +``` + +## Performance Test Results + +Based on testing with RTX 3060 Laptop GPU (6GB VRAM): + +| Provider | First Load | Subsequent | GPU Memory | Recommendation | +|----------|------------|-----------|------------|----------------| +| API Clients | 1-2s | 1-2s | 0GB (cloud) | ⭐ **Best** | +| HuggingFace | ~40s | <1s | 0.27GB | ⭐ **Excellent** | +| VLLM Server | ~30s | <1s | Variable | ✅ Good | +| VLLM Direct | **72+s** | <1s | 0.24GB | ❌ **Avoid** | + +## Configuration Examples + +### HuggingFace Local Setup + +```python +# GPU inference (recommended) +loader = create_loader( + "huggingface", + solver_model="gpt2", + grader_model="gpt2", + device="cuda" +) + +# CPU inference (slower) +loader = create_loader( + "huggingface", + solver_model="gpt2", + grader_model="gpt2", + device="cpu" +) +``` + +### OpenAI Configuration + +```python +# High-quality setup +loader = create_loader( + "openai", + solver_model="gpt-4o-mini", + grader_model="o3" +) +``` + +### Kimi Configuration + +```python +# Standard 8k context window +loader = create_loader( + "kimi", + solver_model="moonshot-v1-8k", + grader_model="moonshot-v1-8k" +) + +# Large context window for complex problems +loader = create_loader( + "kimi", + solver_model="moonshot-v1-128k", + grader_model="moonshot-v1-128k" +) +``` + +### VLLM Server Setup + +```bash +# Start VLLM server first (separate terminal): +conda activate putnam-local +vllm serve meta-llama/Llama-3.2-8B-Instruct --port 8000 +``` + +```python +# Then use the client +loader = create_loader( + "vllm", + solver_model="meta-llama/Llama-3.2-8B-Instruct", + base_url="http://localhost:8000/v1" +) +``` + +## Recommended Usage Patterns + +### For Development/Testing +```python +# Fast iteration with local models +loader = create_loader("huggingface", solver_model="gpt2", device="cuda") +``` + +### For Production/Important Results +```python +# High-quality cloud models +loader = create_loader("openai", solver_model="gpt-4o-mini", grader_model="o3") +loader = create_loader("anthropic", solver_model="claude-3-5-sonnet") +``` + +### For Privacy-Sensitive Work +```python +# Completely local inference +loader = create_loader("huggingface", solver_model="gpt2", device="cuda") +``` + +### For Cost Optimization +```python +# Free local models after setup +loader = create_loader("huggingface", solver_model="gpt2", device="cuda") + +# Or cost-effective cloud +loader = create_loader("openai", solver_model="gpt-4o-mini") +``` + +## Dataset Structure + +The dataset contains Putnam Competition problems in JSON format: + +```json +{ + "original": { + "problem_statement": "Mathematical problem...", + "solution": "Step-by-step solution...", + "problem_type": "proof" + }, + "descriptive_long": {...}, + "kernel_variant": {...} +} +``` + +### Problem Variants + +- **original**: The standard problem statement +- **descriptive_long**: More verbose problem description +- **descriptive_long_confusing**: Intentionally confusing wording +- **descriptive_long_misleading**: Misleading formulation +- **garbled_string**: Corrupted text version +- **kernel_variant**: Minimal essential formulation + +## Advanced Usage + +### GPU Memory Management + +```python +# Check GPU status +import torch +if torch.cuda.is_available(): + print(f"GPU: {torch.cuda.get_device_name(0)}") + print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB") +``` + +### Cost Estimation + +```python +# Estimate costs before processing +cost_info = await loader.estimate_cost( + num_problems=100, + avg_problem_length=1000, + avg_solution_length=2000 +) + +print(f"Estimated cost: ${cost_info['total_cost']:.2f}") +print(f"Cost per problem: ${cost_info['cost_per_problem']:.4f}") +``` + +### Batch Processing + +```python +async def process_dataset(dataset_path, provider="openai"): + loader = create_loader(provider) + + problems = [] + for file_path in Path(dataset_path).glob("*.json"): + with open(file_path) as f: + problems.append(json.load(f)) + + # Process problems concurrently + tasks = [ + loader.test_single_problem(problem, variant_type="original") + for problem in problems + ] + + results = await asyncio.gather(*tasks, return_exceptions=True) + return results +``` + +#### Incremental Saving and Resume Support + +The batch evaluation now supports incremental saving and resume functionality: + +```bash +# Start a new evaluation with incremental saving +python putnam_cli.py batch dataset/ --provider openai --output results/my_results.json + +# If interrupted, resume from checkpoint (no other arguments needed!) +python putnam_cli.py batch --resume results/checkpoint_my_results_20240101_120000.json + +# The checkpoint file is automatically created and updated after each problem +# It contains all progress and can be used to resume if the process is interrupted +``` + +**Features:** +- **Automatic Checkpointing**: Results are saved after each problem completes +- **Resume Support**: Continue from where you left off if interrupted +- **Backward Compatibility**: Existing checkpoint files continue to work +- **Atomic Saves**: Uses temporary files to ensure data integrity +- **Progress Tracking**: Shows saved/completed problems in real-time +- **Automatic Cleanup**: Checkpoint files are removed after successful completion + +**Resume Usage:** +```bash +# For new checkpoint files (created after this update) +python putnam_cli.py batch --resume checkpoint_file.json + +# For existing checkpoint files (created before this update) +python putnam_cli.py batch dataset/ --provider openai --resume old_checkpoint_file.json +``` + +The system automatically detects checkpoint format and handles both old and new formats seamlessly. + +### Multi-Variant Testing + +Test all 6 problem variants with a single command using the unified CLI: + +```bash +# Test all variants with OpenAI (50 files per variant) +python putnam_cli.py multi-test --provider openai --max-files 50 + +# Test specific variants only +python putnam_cli.py multi-test --provider anthropic --variants original kernel_variant --max-files 25 + +# Test with custom models and high concurrency +python putnam_cli.py multi-test --provider openai --solver-model gpt-4.1-nano --grader-model o3 --max-files 1051 --concurrent 1100 + +# Test with local models +python putnam_cli.py multi-test --provider huggingface --device cuda --max-files 100 + +# Test with VLLM server +python putnam_cli.py multi-test --provider vllm --vllm-url http://localhost:8000/v1 --max-files 50 +``` + +**Available Problem Variants:** +- `original` - Standard mathematical problems +- `descriptive_long` - Clear variable renaming +- `descriptive_long_confusing` - Random unrelated words (marshmallow, armadillo, etc.) +- `descriptive_long_misleading` - Misleading variable names (nonpositiveterm, negativeinitial, etc.) +- `garbled_string` - Completely scrambled variable names +- `kernel_variant` - Simplified core mathematical version + +**Output Structure:** +``` +multi_variant_results/ +├── openai_gpt-4o-mini_20241201_143022/ +│ ├── original_20241201_143022.json +│ ├── descriptive_long_20241201_143022.json +│ ├── ... +│ └── SUMMARY_openai_gpt-4o-mini_20241201_143022.json +└── multi_config_comparison_20241201_143022.json +``` + +## Troubleshooting + +### CUDA Issues +```bash +# Check CUDA compatibility +python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')" + +# If CUDA issues, reinstall PyTorch +pip uninstall torch torchvision torchaudio -y +pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124 +``` + +### Memory Issues +```python +# Clear GPU cache +import torch +if torch.cuda.is_available(): + torch.cuda.empty_cache() +``` + +### VLLM Performance Issues +- **Problem**: VLLM Direct takes 72+ seconds to load +- **Solution**: Use HuggingFace local or VLLM server instead +- **Alternative**: Use cloud providers for better speed + +## Testing + +```bash +# Activate environment +conda activate putnam-local + +# Quick health checks +python -c " +import asyncio +from loader import create_loader + +async def test(): + # Test available providers + providers = [ + ('openai', 'gpt-4o-mini'), + ('huggingface', 'gpt2'), + ('anthropic', 'claude-3-5-haiku'), + ('kimi', 'moonshot-v1-8k') + ] + for provider, model in providers: + try: + loader = create_loader(provider, solver_model=model) + result = await loader.health_check() + print(f'{provider}: {"✅ Ready" if result else "❌ Failed"}') + except Exception as e: + print(f'{provider}: ❌ Error - {e}') + +asyncio.run(test()) +" +``` + +## Performance Recommendations + +### For Speed ⚡ +1. **API Clients**: Sub-2s response (OpenAI, Anthropic, xAI, Gemini) +2. **HuggingFace**: Fast after first load, completely local +3. **VLLM Server**: Good for sustained workloads + +### For Cost 💰 +1. **HuggingFace**: Free after setup, local inference +2. **OpenAI GPT-4o-mini**: Cost-effective cloud option +3. **Gemini Flash**: Good price/performance ratio + +### For Privacy 🔒 +1. **HuggingFace**: Complete local inference +2. **VLLM Server**: Local deployment with server architecture + +### For Quality 🎯 +1. **OpenAI o3**: Excellent grading capability +2. **Claude-3.5-Sonnet**: Detailed explanations +3. **Gemini Pro**: Strong mathematical reasoning + +## API Reference + +### Supported Providers +- `openai`: OpenAI GPT models +- `anthropic`: Anthropic Claude models +- `gemini`: Google Gemini models +- `xai`: xAI Grok models +- `kimi`: Kimi/Moonshot models +- `vllm`: VLLM server models +- `vllm_direct`: Direct VLLM API (slow) +- `huggingface`: HuggingFace transformers + +### Factory Function + +```python +create_loader(provider: str, **kwargs) -> ModelLoader +``` + +### Key Methods + +```python +# Test a single problem +await loader.test_single_problem(problem_data, variant_type) + +# Health check +await loader.health_check() + +# Cost estimation +await loader.estimate_cost(num_problems) + +# Model information +loader.get_model_info() +``` + +## Contributing + +1. Fork the repository +2. Create a feature branch (`git checkout -b feature/amazing-feature`) +3. Test with the putnam-local environment +4. Add tests for new functionality +5. Submit a pull request + +## License + +[License information] + +## Citation + +If you use this code in your research, please cite: + +```bibtex +@misc{putnam-bench-anon, + title={Putnam Mathematical Problem Solver}, + year={2024}, + howpublished={\url{<repository-url>}} +} +```
\ No newline at end of file diff --git a/putnam-bench-anon/calibrate_to_o3.py b/putnam-bench-anon/calibrate_to_o3.py new file mode 100644 index 0000000..d2373cc --- /dev/null +++ b/putnam-bench-anon/calibrate_to_o3.py @@ -0,0 +1,335 @@ +#!/usr/bin/env python3 +""" +calibrate_to_o3.py – End-to-end pipeline that +1. ingests existing o4-mini grading results for multiple models, +2. draws a budget-constrained stratified sample, +3. (optionally) re-grades those samples with o3 to obtain gold labels, +4. learns per-stratum error rates and calibrates all o4 labels to the o3 scale, +5. outputs required artefacts: + – sample_list.csv + – o3_raw.parquet (only when --run-o3) + – calibrated_o3_scores.csv + +Run: + python calibrate_to_o3.py # stop after sampling only + python calibrate_to_o3.py --run-o3 # also call o3 re-grader + +""" + +from __future__ import annotations +import argparse +import asyncio +import json +import logging +import math +import random +from pathlib import Path +from typing import Dict, List, Tuple, Any + +import numpy as np +import pandas as pd +from scipy.stats import norm + +# Third-party library used by --run-o3 mode +try: + from loader.openai_client import OpenAIModelLoader # type: ignore +except ModuleNotFoundError: + OpenAIModelLoader = None # graceful degradation when running sampling-only mode + +############################################################################### +# Constants – adjust here if the budget or cost model ever changes +############################################################################### +COST_PER_RECORD = 0.154 # USD per o3 grading request +BUDGET_MAX = 800.0 # USD hard cap +N_MAX = math.floor(BUDGET_MAX / COST_PER_RECORD) # 5194 with default params +SEED = 42 +MIN_PER_LAYER = 10 + +############################# Logging setup ################################### +logging.basicConfig( + level=logging.INFO, + format="%(asctime)s │ %(levelname)-8s │ %(message)s", + datefmt="%H:%M:%S", +) +LOGGER = logging.getLogger("calibrate") + +############################################################################### +# Utility functions +############################################################################### + +def wilson_ci(k: float, n: int, conf: float = 0.95) -> Tuple[float, float]: + """Wilson score interval for a proportion. + k may be fractional (calibrated successes). Returns (low, high).""" + if n == 0: + return 0.0, 0.0 + z = norm.ppf(1 - (1 - conf) / 2) + p_hat = k / n + denom = 1 + z ** 2 / n + centre = (p_hat + z ** 2 / (2 * n)) / denom + half_width = ( + z + * math.sqrt((p_hat * (1 - p_hat) + z ** 2 / (4 * n)) / n) + / denom + ) + return max(0.0, centre - half_width), min(1.0, centre + half_width) + + +def parse_diff(index: str) -> int: + """Extract trailing difficulty digit (1-6) from an index like 2024-B-6.""" + try: + return int(index.split("-")[-1]) + except (ValueError, IndexError): + return -1 # fallback – will be filtered out later + +############################################################################### +# 1 Load meta-data from dataset/*.json – mapping from problem index to (type,diff) +############################################################################### + +def load_dataset_metadata(dataset_dir: Path) -> Dict[str, Tuple[str, int]]: + mapping: Dict[str, Tuple[str, int]] = {} + json_files = sorted(dataset_dir.glob("*.json")) + for fp in json_files: + try: + with fp.open("r", encoding="utf-8") as f: + data = json.load(f) + idx = data.get("index") + typ = data.get("type") + diff = parse_diff(idx) + if idx and typ and diff != -1: + mapping[idx] = (typ, diff) + except Exception as e: + LOGGER.warning(f"Failed to parse {fp}: {e}") + LOGGER.info(f"Loaded metadata for {len(mapping):,} problems from dataset") + return mapping + +############################################################################### +# 2 Load all o4-mini result JSONs into one DataFrame +############################################################################### + +def load_o4_results(results_root: Path, meta: Dict[str, Tuple[str, int]]) -> pd.DataFrame: + rows: List[Dict[str, Any]] = [] + model_dirs = [d for d in results_root.iterdir() if d.is_dir()] + for model_dir in model_dirs: + model_id = model_dir.name + # consider only *_original.json for uniformity + for fp in model_dir.glob("*original.json"): + try: + with fp.open("r", encoding="utf-8") as f: + res = json.load(f) + for pr in res.get("problems", []): + idx = pr.get("index") + grade_info = pr.get("grade", {}) + o4_score = int(grade_info.get("grade") == "CORRECT") + # meta info + typ, diff = meta.get(idx, (None, None)) + if typ is None: + continue # skip problems without meta + row = { + "id": idx, + "model_id": model_id, + "type": typ, + "diff": diff, + "o4_score": o4_score, + # Extra fields useful for optional o3 grading + "student_solution": pr.get("solve", {}).get("solution", ""), + } + rows.append(row) + except Exception as e: + LOGGER.warning(f"Failed to process {fp}: {e}") + df = pd.DataFrame(rows) + LOGGER.info(f"Ingested {len(df):,} problem-model pairs across {df['model_id'].nunique()} models") + return df + +############################################################################### +# 3 Stratified sampling under budget +############################################################################### + +def stratified_sample(df: pd.DataFrame) -> pd.DataFrame: + rng = np.random.default_rng(SEED) + group_cols = ["type", "diff", "o4_score"] + + # Compute desired sample sizes per layer + layer_counts = df.groupby(group_cols, observed=True).size().rename("N_k") + total_records = len(df) + target_sizes = ( + (layer_counts / total_records * N_MAX).apply(np.ceil).astype(int).clip(lower=MIN_PER_LAYER) + ) + + # If the initial allocation exceeds budget, scale down proportionally (but keep >=MIN_PER_LAYER) + total_target = target_sizes.sum() + if total_target > N_MAX: + LOGGER.info( + f"Initial allocation {total_target} exceeds N_MAX={N_MAX}. Scaling down proportionally." + ) + scaling = (N_MAX - MIN_PER_LAYER * target_sizes.size) / ( + total_target - MIN_PER_LAYER * target_sizes.size + ) + scaling = max(scaling, 0.0) + target_sizes = ( + MIN_PER_LAYER + + np.floor((target_sizes - MIN_PER_LAYER) * scaling).astype(int) + ) + LOGGER.info( + f"Final per-stratum sample sizes prepared (sum={target_sizes.sum()}) – within budget" + ) + + # Actual sampling + samples = [] + for key, group in df.groupby(group_cols, observed=True): + n = min(target_sizes.get(key, MIN_PER_LAYER), len(group)) + if n <= 0: + continue + sample_idx = rng.choice(group.index.to_numpy(), size=n, replace=False) + samples.append(df.loc[sample_idx]) + sample_df = pd.concat(samples, ignore_index=True) + LOGGER.info(f"Sampled {len(sample_df):,} rows in total (<= {N_MAX})") + return sample_df + +############################################################################### +# 4 Async o3 re-grading helper +############################################################################### + +async def grade_with_o3(sample_df: pd.DataFrame, meta: Dict[str, Tuple[str, int]]) -> pd.Series: + """Returns pd.Series of int o3_score aligned with sample_df.index.""" + if OpenAIModelLoader is None: + raise RuntimeError("OpenAIModelLoader not available. Install dependencies or run without --run-o3.") + + async with OpenAIModelLoader(solver_model="o3", grader_model="o3") as loader: + + async def grade_one(row) -> int: + idx = row.id + question = None + reference_solution = None + # load dataset file lazily when needed + dataset_file = Path("dataset") / f"{idx}.json" + if dataset_file.exists(): + try: + with dataset_file.open("r", encoding="utf-8") as f: + data = json.load(f) + question = data.get("question", "") + reference_solution = data.get("solution", "") + except Exception: + pass + if not question: + return -1 # cannot grade + student_solution = row.student_solution or "" + try: + grade_result, _ = await loader.grade_solution( + question, + student_solution, + reference_solution, + problem_type="proof", + model="o3", + ) + return int(grade_result.get("grade") == "CORRECT") if grade_result else -1 + except Exception as exc: + LOGGER.warning(f"o3 grading failed for {idx}: {exc}") + return -1 + + sem = asyncio.Semaphore(20) + async def sem_grade(row): + async with sem: + return await grade_one(row) + + tasks = [asyncio.create_task(sem_grade(row)) for _, row in sample_df.iterrows()] + o3_scores = await asyncio.gather(*tasks) + return pd.Series(o3_scores, index=sample_df.index, name="o3_score") + +############################################################################### +# 5 Calibration – compute per-stratum error rates and apply +############################################################################### + +def compute_error_rates(sample_df: pd.DataFrame) -> pd.DataFrame: + group_cols = ["type", "diff"] + + # Build contingency counts per stratum + counts = sample_df.groupby(group_cols + ["o4_score", "o3_score"], observed=True).size().unstack(fill_value=0) + # Ensure o3_score columns 0 and 1 exist + for col in [0, 1]: + if col not in counts.columns: + counts[col] = 0 + # counts index columns: type, diff, o4_score + # Compute p1_k and p0_k + records = [] + for (typ, diff, o4_val), row in counts.reset_index().groupby(["type", "diff", "o4_score"], observed=True): + n = row[[0, 1]].sum(axis=1).values[0] + k = row[0].values[0] # for p1 or p0 depends + if o4_val == 1: # looking at false positives (o4=1 but o3=0) + p1 = k / n if n else 0.10 + records.append({"type": typ, "diff": diff, "p1": p1}) + else: # o4=0 + p0 = row[1].values[0] / n if n else 0.10 + records.append({"type": typ, "diff": diff, "p0": p0}) + errs = pd.DataFrame(records).groupby(["type", "diff"], observed=True).first().reset_index() + errs["p1"].fillna(0.10, inplace=True) + errs["p0"].fillna(0.10, inplace=True) + return errs + + +def apply_calibration(full_df: pd.DataFrame, err_df: pd.DataFrame) -> pd.Series: + merged = full_df.merge(err_df, on=["type", "diff"], how="left") + merged["p1"].fillna(0.10, inplace=True) + merged["p0"].fillna(0.10, inplace=True) + est = np.where(merged.o4_score == 1, 1 - merged.p1, merged.p0) + return pd.Series(est, index=full_df.index, name="o3_est") + +############################################################################### +# 6 Main entry +############################################################################### + +def main(): + parser = argparse.ArgumentParser(description="Calibrate o4-mini results to o3 scale") + parser.add_argument("--run-o3", action="store_true", help="Actually call o3 to grade the sampled pairs") + parser.add_argument("--output-dir", default="calibration_out", help="Directory to store generated artefacts") + args = parser.parse_args() + + out_dir = Path(args.output_dir) + out_dir.mkdir(parents=True, exist_ok=True) + + # 1 Load meta and results + meta = load_dataset_metadata(Path("dataset")) + full_df = load_o4_results(Path("results"), meta) + + # 2 Sampling + sample_df = stratified_sample(full_df) + sample_df.to_csv(out_dir / "sample_list.csv", index=False) + + if args.run_o3: + LOGGER.info("Starting o3 re-grading – this may incur cost!") + start = asyncio.run(grade_with_o3(sample_df, meta)) + sample_df["o3_score"] = start + sample_df.to_parquet(out_dir / "o3_raw.parquet", index=False) + spent = sample_df["o3_score"].notna().sum() * COST_PER_RECORD + LOGGER.info(f"o3 grading finished. Cost ≈ ${spent:.2f}") + else: + LOGGER.info("--run-o3 not provided; skipping API calls and downstream calibration") + return # exit early + + # 3 Calibration + err_df = compute_error_rates(sample_df) + full_df["o3_est"] = apply_calibration(full_df, err_df) + + # 4 Aggregate per model + agg_rows = [] + for model_id, grp in full_df.groupby("model_id", observed=True): + mean_est = grp.o3_est.mean() + n = len(grp) + k_hat = mean_est * n + ci_low, ci_high = wilson_ci(k_hat, n) + agg_rows.append({ + "model_id": model_id, + "mean": mean_est, + "ci_low": ci_low, + "ci_high": ci_high, + }) + agg_df = pd.DataFrame(agg_rows) + agg_df.to_csv(out_dir / "calibrated_o3_scores.csv", index=False) + LOGGER.info("Calibration finished. Artefacts saved to %s", out_dir) + + +if __name__ == "__main__": + try: + main() + except Exception as exc: + LOGGER.error("Fatal error: %s", exc) + raise
\ No newline at end of file diff --git a/putnam-bench-anon/docs/openrouter_usage.md b/putnam-bench-anon/docs/openrouter_usage.md new file mode 100644 index 0000000..85e5712 --- /dev/null +++ b/putnam-bench-anon/docs/openrouter_usage.md @@ -0,0 +1,143 @@ +# OpenRouter Integration Guide + +## Overview + +OpenRouter provides access to multiple AI model providers through a single API endpoint. This integration allows you to use models from OpenAI, Anthropic, Google, Meta, Mistral, and many other providers with a unified interface. + +## Setup + +### 1. Get an API Key + +Sign up at [OpenRouter.ai](https://openrouter.ai) to get your API key. + +### 2. Set Environment Variable + +```bash +export OPENROUTER_API_KEY="your-api-key-here" +``` + +## Usage + +### Basic Usage + +```python +from loader import create_loader + +# Create OpenRouter loader with default models +loader = create_loader("openrouter") + +# Or specify custom models +loader = create_loader( + "openrouter", + solver_model="openai/gpt-4o", + grader_model="anthropic/claude-3-opus" +) +``` + +### Direct Instantiation + +```python +from loader import OpenRouterModelLoader + +loader = OpenRouterModelLoader( + solver_model="openai/gpt-4o-mini", + grader_model="openai/gpt-4o", + api_key="your-api-key", # Optional, uses env var if not provided + site_url="https://yoursite.com", # Optional, for rankings + site_name="Your Site Name" # Optional, for rankings +) +``` + +### Using Cross-Provider Features + +One of the key advantages of OpenRouter is the ability to mix models from different providers: + +```python +# Use OpenAI for solving and Anthropic for grading +loader = create_loader( + "openrouter", + solver_model="openai/gpt-4o", + grader_model="anthropic/claude-3-opus" +) + +# Use Google's Gemini for solving and Meta's Llama for grading +loader = create_loader( + "openrouter", + solver_model="google/gemini-pro", + grader_model="meta-llama/llama-3-70b-instruct" +) +``` + +## Available Models + +OpenRouter supports a wide variety of models. Here are some popular options: + +### OpenAI Models +- `openai/gpt-4o` - GPT-4 Optimized +- `openai/gpt-4o-mini` - Smaller, faster GPT-4 +- `openai/gpt-4-turbo` - GPT-4 Turbo +- `openai/gpt-3.5-turbo` - GPT-3.5 Turbo +- `openai/o1-preview` - O1 Preview (reasoning model) +- `openai/o1-mini` - O1 Mini + +### Anthropic Models +- `anthropic/claude-3-opus` - Most capable Claude model +- `anthropic/claude-3-sonnet` - Balanced performance +- `anthropic/claude-3-haiku` - Fastest Claude model +- `anthropic/claude-2.1` - Previous generation +- `anthropic/claude-2` - Previous generation + +### Google Models +- `google/gemini-pro` - Gemini Pro +- `google/gemini-pro-vision` - Gemini Pro with vision +- `google/palm-2-codechat-bison` - PaLM 2 for code +- `google/palm-2-chat-bison` - PaLM 2 for chat + +### Meta Models +- `meta-llama/llama-3-70b-instruct` - Llama 3 70B +- `meta-llama/llama-3-8b-instruct` - Llama 3 8B +- `meta-llama/codellama-70b-instruct` - CodeLlama 70B + +### Mistral Models +- `mistralai/mistral-large` - Mistral Large +- `mistralai/mistral-medium` - Mistral Medium +- `mistralai/mistral-small` - Mistral Small +- `mistralai/mixtral-8x7b-instruct` - Mixtral MoE + +### Other Models +- `cohere/command-r-plus` - Cohere Command R+ +- `deepseek/deepseek-coder` - DeepSeek Coder +- `qwen/qwen-2-72b-instruct` - Qwen 2 72B + +For a complete and up-to-date list, visit: https://openrouter.ai/models + +## Testing + +Run the test script to verify your setup: + +```bash +python test_openrouter.py +``` + +## Cost Considerations + +Different models have different pricing. Generally: +- Mini/small models are cheapest +- Standard models are moderately priced +- Large/opus models are most expensive + +Check [OpenRouter pricing](https://openrouter.ai/models) for current rates. + +## Troubleshooting + +### API Key Issues +- Ensure `OPENROUTER_API_KEY` is set correctly +- Check that your API key has sufficient credits + +### Model Availability +- Some models may have limited availability +- Check OpenRouter status page if a model isn't responding + +### Rate Limits +- OpenRouter has rate limits that vary by model +- The loader includes automatic retry logic for rate limit errors
\ No newline at end of file diff --git a/putnam-bench-anon/examples/openrouter_example.py b/putnam-bench-anon/examples/openrouter_example.py new file mode 100644 index 0000000..bc75877 --- /dev/null +++ b/putnam-bench-anon/examples/openrouter_example.py @@ -0,0 +1,157 @@ +#!/usr/bin/env python3 +""" +Example of using OpenRouter with putnam-bench to solve mathematical problems. + +This example demonstrates: +1. Using different model combinations from different providers +2. Solving a real problem from the dataset +3. Comparing results across different models +""" + +import asyncio +import json +import os +from loader import create_loader + +async def solve_with_openrouter(): + """Example of solving a Putnam problem using OpenRouter.""" + + # Check API key + if not os.getenv('OPENROUTER_API_KEY'): + print("❌ Please set OPENROUTER_API_KEY environment variable") + return + + # Load a sample problem + problem_file = "dataset/1938-A-1.json" + if not os.path.exists(problem_file): + print(f"❌ Problem file not found: {problem_file}") + print(" Make sure you're running from the project root directory") + return + + with open(problem_file) as f: + problem_data = json.load(f) + + print(f"📚 Problem: {problem_data['problem_statement'][:100]}...") + print(f" Type: {problem_data['problem_type']}") + print(f" Year: {problem_data['year']}") + + # Test with different model combinations + test_configs = [ + { + "name": "OpenAI Only", + "solver": "openai/gpt-4o-mini", + "grader": "openai/gpt-4o" + }, + { + "name": "Mixed OpenAI/Anthropic", + "solver": "openai/gpt-4o", + "grader": "anthropic/claude-3-haiku" + }, + { + "name": "Google Gemini", + "solver": "google/gemini-pro", + "grader": "google/gemini-pro" + } + ] + + for config in test_configs: + print(f"\n{'='*60}") + print(f"🧪 Testing: {config['name']}") + print(f" Solver: {config['solver']}") + print(f" Grader: {config['grader']}") + + try: + # Create loader with specific models + loader = create_loader( + "openrouter", + solver_model=config['solver'], + grader_model=config['grader'], + retries=3, + timeout_base=120 + ) + + # Solve the problem + print("\n⏳ Solving problem...") + solution, raw = await loader.solve_problem(problem_data['problem_statement']) + + if solution: + print("✅ Solution found!") + print(f" Final answer: {solution.get('final_answer', 'N/A')}") + + # Grade the solution (if it's a proof problem) + if problem_data['problem_type'] == 'proof': + print("\n⏳ Grading solution...") + grade_result = await loader.grade_solution( + problem_data['problem_statement'], + solution['solution'], + problem_data.get('ground_truth_solution', ''), + problem_type='proof' + ) + + if grade_result: + print(f"📊 Grade: {grade_result.get('score', 'N/A')}/10") + print(f" Reasoning: {grade_result.get('reasoning', 'N/A')[:100]}...") + else: + print(" (Calculation problem - grading skipped)") + else: + print("❌ Failed to get solution") + + except Exception as e: + print(f"❌ Error: {type(e).__name__}: {e}") + + print(f"\n{'='*60}") + print("✅ Example completed!") + +async def list_recommended_models(): + """List recommended model combinations for different use cases.""" + + print("\n📋 Recommended OpenRouter Model Combinations:\n") + + recommendations = [ + { + "use_case": "Best Quality (Expensive)", + "solver": "openai/gpt-4o", + "grader": "anthropic/claude-3-opus", + "notes": "Highest accuracy but most expensive" + }, + { + "use_case": "Balanced Performance", + "solver": "openai/gpt-4o-mini", + "grader": "anthropic/claude-3-sonnet", + "notes": "Good balance of cost and performance" + }, + { + "use_case": "Budget Friendly", + "solver": "openai/gpt-3.5-turbo", + "grader": "google/gemini-pro", + "notes": "Cheapest option, still decent quality" + }, + { + "use_case": "Open Source Models", + "solver": "meta-llama/llama-3-70b-instruct", + "grader": "mistralai/mixtral-8x7b-instruct", + "notes": "Using open-source models only" + }, + { + "use_case": "Code-Focused", + "solver": "deepseek/deepseek-coder", + "grader": "meta-llama/codellama-70b-instruct", + "notes": "Optimized for problems with code" + } + ] + + for rec in recommendations: + print(f"🎯 {rec['use_case']}") + print(f" Solver: {rec['solver']}") + print(f" Grader: {rec['grader']}") + print(f" Notes: {rec['notes']}") + print() + +if __name__ == "__main__": + print("🚀 OpenRouter Example for Putnam Bench") + + # Run the example + asyncio.run(solve_with_openrouter()) + + # Show recommendations + asyncio.run(list_recommended_models())
\ No newline at end of file diff --git a/putnam-bench-anon/install.py b/putnam-bench-anon/install.py new file mode 100644 index 0000000..36812b9 --- /dev/null +++ b/putnam-bench-anon/install.py @@ -0,0 +1,240 @@ +#!/usr/bin/env python3 +""" +Quick installation and setup script for Putnam Problem Solver. + +This script provides a one-command setup for the entire system. + +Usage: + python install.py # Full installation + python install.py --quick # Quick setup (minimal dependencies) + python install.py --check-only # Just check what would be installed +""" + +import asyncio +import sys +import subprocess +import os +from pathlib import Path +import argparse + + +def print_banner(): + """Print installation banner.""" + print("🚀 Putnam Problem Solver - Quick Install") + print("=" * 50) + print("This script will set up everything you need to get started.") + print() + + +def check_python(): + """Check Python version.""" + version = sys.version_info + if version.major < 3 or (version.major == 3 and version.minor < 8): + print("❌ Python 3.8+ required") + print(f" Current version: {version.major}.{version.minor}.{version.micro}") + return False + + print(f"✅ Python {version.major}.{version.minor}.{version.micro}") + return True + + +def install_packages(packages: list, check_only: bool = False): + """Install required packages.""" + if not packages: + print("✅ All required packages are already installed!") + return True + + print(f"📦 {'Would install' if check_only else 'Installing'}: {', '.join(packages)}") + + if check_only: + return True + + try: + subprocess.run([ + sys.executable, '-m', 'pip', 'install', '--upgrade' + ] + packages, check=True) + print("✅ Packages installed successfully!") + return True + except subprocess.CalledProcessError as e: + print(f"❌ Failed to install packages: {e}") + return False + + +def check_package_installed(package: str) -> bool: + """Check if a package is installed.""" + try: + if package == 'google-generativeai': + import google.generativeai + else: + __import__(package) + return True + except ImportError: + return False + + +def get_missing_packages(include_optional: bool = True) -> list: + """Get list of missing packages.""" + # Core packages (always required) + core_packages = ['openai', 'anthropic', 'google-generativeai', 'psutil'] + + # Optional packages (for local models) + optional_packages = ['transformers', 'torch', 'vllm'] if include_optional else [] + + missing = [] + + print("🔍 Checking required packages...") + for package in core_packages: + if check_package_installed(package): + print(f" ✅ {package}") + else: + print(f" ❌ {package}") + missing.append(package) + + if include_optional: + print("\n🔍 Checking optional packages (for local models)...") + for package in optional_packages: + if check_package_installed(package): + print(f" ✅ {package}") + else: + print(f" ⚠️ {package} (optional)") + missing.append(package) + + return missing + + +def create_alias(): + """Create putnam command alias.""" + script_dir = Path(__file__).parent.absolute() + putnam_cli = script_dir / "putnam_cli.py" + + # For Unix-like systems + if os.name != 'nt': + shell_profile = Path.home() / ".bashrc" + if not shell_profile.exists(): + shell_profile = Path.home() / ".bash_profile" + if not shell_profile.exists(): + shell_profile = Path.home() / ".zshrc" + + alias_line = f'alias putnam="python {putnam_cli}"' + + try: + # Check if alias already exists + if shell_profile.exists(): + with open(shell_profile, 'r') as f: + content = f.read() + if 'alias putnam=' in content: + print("✅ 'putnam' alias already exists") + return True + + # Add alias + with open(shell_profile, 'a') as f: + f.write(f"\n# Putnam Problem Solver alias\n{alias_line}\n") + + print(f"✅ Added 'putnam' alias to {shell_profile}") + print(f"💡 Restart your shell or run: source {shell_profile}") + return True + + except Exception as e: + print(f"⚠️ Could not create alias: {e}") + print(f"💡 You can manually add: {alias_line}") + + else: + # Windows + print("💡 On Windows, use: python putnam_cli.py <command>") + + return False + + +async def run_setup(): + """Run the configuration setup.""" + print("\n🛠️ Running configuration setup...") + try: + from setup_config import ConfigManager + manager = ConfigManager() + await manager.interactive_setup() + return True + except ImportError: + print("❌ Configuration setup not available") + return False + except Exception as e: + print(f"❌ Setup failed: {e}") + return False + + +def print_next_steps(): + """Print next steps for the user.""" + print("\n🎉 Installation completed!") + print("\n📚 Next Steps:") + print(" 1. Set up API keys: python setup_config.py") + print(" 2. Check health: python health_check.py") + print(" 3. Quick test: python putnam_cli.py test --provider openai") + print(" 4. Solve a problem: python putnam_cli.py solve dataset/1938-A-1.json") + print(" 5. Run benchmark: python benchmark.py --quick-test") + print("\n💡 Available Scripts:") + print(" • putnam_cli.py - Main CLI interface") + print(" • health_check.py - Check provider health") + print(" • batch_evaluate.py - Batch evaluation") + print(" • benchmark.py - Performance comparison") + print(" • setup_config.py - Configuration management") + print(" • local_models_example.py - Local model examples") + print("\n📖 Documentation: See README.md for detailed usage") + + +async def main(): + """Main installation function.""" + parser = argparse.ArgumentParser(description="Install Putnam Problem Solver") + parser.add_argument("--quick", action="store_true", + help="Quick install (core packages only)") + parser.add_argument("--check-only", action="store_true", + help="Check what would be installed without installing") + parser.add_argument("--no-setup", action="store_true", + help="Skip interactive configuration setup") + parser.add_argument("--no-alias", action="store_true", + help="Skip creating command alias") + + args = parser.parse_args() + + print_banner() + + # Check Python version + if not check_python(): + return 1 + + # Check packages + missing_packages = get_missing_packages(include_optional=not args.quick) + + if args.check_only: + if missing_packages: + print(f"\n📋 Would install: {', '.join(missing_packages)}") + else: + print("\n✅ All packages are already installed!") + return 0 + + # Install packages + if missing_packages: + print(f"\n📦 Installing {len(missing_packages)} packages...") + if not install_packages(missing_packages): + return 1 + else: + print("\n✅ All packages are already installed!") + + # Create alias + if not args.no_alias: + print("\n🔗 Creating command alias...") + create_alias() + + # Run configuration setup + if not args.no_setup: + if input("\n🛠️ Run configuration setup now? (y/n): ").lower().startswith('y'): + await run_setup() + else: + print("💡 You can run setup later with: python setup_config.py") + + # Print next steps + print_next_steps() + + return 0 + + +if __name__ == "__main__": + exit(asyncio.run(main()))
\ No newline at end of file diff --git a/putnam-bench-anon/loader/__init__.py b/putnam-bench-anon/loader/__init__.py new file mode 100644 index 0000000..f6e9cf9 --- /dev/null +++ b/putnam-bench-anon/loader/__init__.py @@ -0,0 +1,290 @@ +""" +Model loader package for mathematical problem solving. + +This package provides a unified interface for loading and using different AI models +to solve mathematical problems and grade solutions. + +Usage: + # Create an OpenAI loader + loader = create_loader("openai", solver_model="gpt-4o-mini", grader_model="o3") + + # Create an OpenRouter loader + loader = create_loader("openrouter", solver_model="openai/gpt-4o", grader_model="anthropic/claude-3-opus") + + # Or directly instantiate + from loader import OpenAIModelLoader + loader = OpenAIModelLoader() + + # Test a problem + import json + with open("dataset/1938-A-1.json") as f: + data = json.load(f) + + result = await loader.test_single_problem(data, variant_type="original") +""" + +from .base import ModelLoader +from .openai_client import OpenAIModelLoader, KimiModelLoader +from .anthropic_client import AnthropicModelLoader +from .gemini_client import GeminiModelLoader +from .xai_client import XAIModelLoader +from .openrouter_client import OpenRouterModelLoader +from .vllm_local import VLLMModelLoader +from .vllm_direct import VLLMDirectModelLoader +from .hf_local import HuggingFaceModelLoader +from .cross_provider import CrossProviderLoader +from .prompts import ( + SOLVER_SYSTEM_PROMPT, + SOLVER_USER_TEMPLATE, + PROOF_GRADER_SYSTEM_PROMPT, + CALCULATION_GRADER_SYSTEM_PROMPT, + PROOF_GRADER_USER_TEMPLATE, + CALCULATION_GRADER_USER_TEMPLATE, + RESPONSE_FORMAT, + DEFAULT_RETRIES, + DEFAULT_TIMEOUT_BASE +) +from typing import Optional + +def create_loader(provider: str, **kwargs) -> ModelLoader: + """ + Factory function to create model loaders. + + Args: + provider: Provider name ("openai", "anthropic", "gemini", etc.) + **kwargs: Additional arguments passed to the loader constructor + + Returns: + ModelLoader instance + + Raises: + ValueError: If provider is not supported + + Examples: + # Create OpenAI loader + loader = create_loader("openai", solver_model="gpt-4o-mini", grader_model="o3") + + # Create loader with custom settings + loader = create_loader( + "openai", + solver_model="gpt-4", + grader_model="o3", + retries=5, + timeout_base=600 + ) + """ + provider_lower = provider.lower() + + if provider_lower == "openai": + return OpenAIModelLoader(**kwargs) + elif provider_lower == "anthropic": + return AnthropicModelLoader(**kwargs) + elif provider_lower == "gemini": + return GeminiModelLoader(**kwargs) + elif provider_lower == "xai": + return XAIModelLoader(**kwargs) + elif provider_lower == "openrouter": + return OpenRouterModelLoader(**kwargs) + elif provider_lower == "kimi": + return KimiModelLoader(**kwargs) + elif provider_lower == "vllm": + return VLLMModelLoader(**kwargs) + elif provider_lower == "vllm_direct": + return VLLMDirectModelLoader(**kwargs) + elif provider_lower in ["huggingface", "hf"]: + return HuggingFaceModelLoader(**kwargs) + else: + supported = ["openai", "anthropic", "gemini", "xai", "openrouter", "kimi", "vllm", "vllm_direct", "huggingface"] + raise ValueError(f"Unsupported provider: {provider}. Supported providers: {supported}") + + +def create_cross_provider_loader( + solver_provider: str, + grader_provider: Optional[str] = None, + solver_model: Optional[str] = None, + grader_model: Optional[str] = None, + **kwargs +) -> ModelLoader: + """ + Create a loader that can use different providers for solving and grading. + + Args: + solver_provider: Provider for solving problems + grader_provider: Provider for grading (if None, uses solver_provider) + solver_model: Override solver model + grader_model: Override grader model + **kwargs: Additional arguments (can include provider-specific settings) + + Returns: + CrossProviderLoader instance + + Examples: + # Use Kimi for solving and OpenAI for grading + loader = create_cross_provider_loader( + solver_provider="kimi", + grader_provider="openai", + solver_model="Kimi-K2-Instruct", + grader_model="o3" + ) + + # Use same provider but different models + loader = create_cross_provider_loader( + solver_provider="openai", + solver_model="gpt-4o-mini", + grader_model="o3" + ) + """ + # Extract provider-specific kwargs + solver_kwargs = kwargs.pop('solver_kwargs', {}) + grader_kwargs = kwargs.pop('grader_kwargs', {}) + + # Extract common parameters that should be passed to both loaders + quick = kwargs.pop('quick', False) + debug = kwargs.pop('debug', False) + + # Add common parameters to both solver and grader kwargs + solver_kwargs.update({'quick': quick, 'debug': debug}) + grader_kwargs.update({'quick': quick, 'debug': debug}) + + # Get default models if not specified + if not solver_model: + solver_defaults = get_default_models(solver_provider) + solver_model = solver_defaults['solver_model'] + + if not grader_provider: + grader_provider = solver_provider + + if not grader_model: + grader_defaults = get_default_models(grader_provider) + grader_model = grader_defaults['grader_model'] + + # Create solver loader + solver_loader = create_loader( + solver_provider, + solver_model=solver_model, + grader_model=solver_model, # Use solver model for both in solver loader + **solver_kwargs + ) + + # Create grader loader if different provider + if grader_provider != solver_provider: + grader_loader = create_loader( + grader_provider, + solver_model=grader_model, # Use grader model for both in grader loader + grader_model=grader_model, + **grader_kwargs + ) + return CrossProviderLoader(solver_loader, grader_loader, **kwargs) + else: + # Same provider, but possibly different models + if solver_model != grader_model: + # Need to create a single loader with both models + single_loader = create_loader( + solver_provider, + solver_model=solver_model, + grader_model=grader_model, + **solver_kwargs + ) + return single_loader + else: + # Same provider and model + return solver_loader + +def get_supported_providers() -> list[str]: + """ + Get list of supported model providers. + + Returns: + List of supported provider names + """ + return ["openai", "anthropic", "gemini", "xai", "openrouter", "kimi", "vllm", "vllm_direct", "huggingface"] + +def get_default_models(provider: str) -> dict[str, str]: + """ + Get default model names for a provider. + + Args: + provider: Provider name + + Returns: + Dictionary with default solver_model and grader_model + """ + defaults = { + "openai": { + "solver_model": "gpt-4o-mini", + "grader_model": "o3" + }, + "anthropic": { + "solver_model": "claude-3-5-haiku-20241022", + "grader_model": "claude-3-5-sonnet-20241022" + }, + "gemini": { + "solver_model": "gemini-1.5-flash", + "grader_model": "gemini-1.5-pro" + }, + "xai": { + "solver_model": "grok-3", + "grader_model": "grok-3" + }, + "openrouter": { + "solver_model": "openai/gpt-4o", + "grader_model": "openai/gpt-4o" + }, + "kimi": { + "solver_model": "moonshot-v1-8k", + "grader_model": "moonshot-v1-8k" + }, + "vllm": { + "solver_model": "meta-llama/Llama-3.2-3B-Instruct", + "grader_model": "meta-llama/Llama-3.2-8B-Instruct" + }, + "vllm_direct": { + "solver_model": "gpt2", + "grader_model": "gpt2" + }, + "huggingface": { + "solver_model": "microsoft/DialoGPT-medium", + "grader_model": "microsoft/DialoGPT-large" + } + } + + provider_lower = provider.lower() + if provider_lower not in defaults: + raise ValueError(f"No defaults available for provider: {provider}") + + return defaults[provider_lower] + +# Export main classes and functions +__all__ = [ + # Main classes + "ModelLoader", + "OpenAIModelLoader", + "AnthropicModelLoader", + "GeminiModelLoader", + "XAIModelLoader", + "OpenRouterModelLoader", + "KimiModelLoader", + "VLLMModelLoader", + "VLLMDirectModelLoader", + "HuggingFaceModelLoader", + "CrossProviderLoader", + + # Factory functions + "create_loader", + "create_cross_provider_loader", + "get_supported_providers", + "get_default_models", + + # Prompts (for advanced users) + "SOLVER_SYSTEM_PROMPT", + "SOLVER_USER_TEMPLATE", + "PROOF_GRADER_SYSTEM_PROMPT", + "CALCULATION_GRADER_SYSTEM_PROMPT", + "PROOF_GRADER_USER_TEMPLATE", + "CALCULATION_GRADER_USER_TEMPLATE", + + # Configuration constants + "RESPONSE_FORMAT", + "DEFAULT_RETRIES", + "DEFAULT_TIMEOUT_BASE" +] diff --git a/putnam-bench-anon/loader/anthropic_client.py b/putnam-bench-anon/loader/anthropic_client.py new file mode 100644 index 0000000..e81f220 --- /dev/null +++ b/putnam-bench-anon/loader/anthropic_client.py @@ -0,0 +1,227 @@ +""" +Anthropic model loader implementation. +Handles API calls to Anthropic Claude models with proper error handling and retry logic. +""" + +import asyncio +import random +from typing import Dict, List, Tuple, Optional + +try: + from anthropic import AsyncAnthropic, RateLimitError, APIError, APIConnectionError +except ImportError: + AsyncAnthropic = None + RateLimitError = Exception + APIError = Exception + APIConnectionError = Exception + +from .base import ModelLoader +from .prompts import RESPONSE_FORMAT + + +class AnthropicModelLoader(ModelLoader): + """Anthropic implementation of the ModelLoader.""" + + def __init__(self, + solver_model: str = "claude-3-5-haiku-20241022", + grader_model: str = "claude-3-5-sonnet-20241022", + api_key: Optional[str] = None, + base_url: Optional[str] = None, + **kwargs): + """ + Initialize Anthropic model loader. + + Args: + solver_model: Anthropic model for solving problems (default: claude-3-5-haiku) + grader_model: Anthropic model for grading solutions (default: claude-3-5-sonnet) + api_key: Anthropic API key (if None, uses environment variable) + base_url: Custom base URL for Anthropic API + **kwargs: Additional arguments passed to parent class + """ + if AsyncAnthropic is None: + raise ImportError( + "anthropic package is required for AnthropicModelLoader. " + "Install with: pip install anthropic" + ) + + super().__init__(solver_model, grader_model, **kwargs) + + # Initialize Anthropic client + client_kwargs = {} + if api_key: + client_kwargs["api_key"] = api_key + if base_url: + client_kwargs["base_url"] = base_url + + self.client = AsyncAnthropic(**client_kwargs) + + async def _call_api(self, + model: str, + messages: List[Dict[str, str]], + temperature: float = 0.0) -> Tuple[Optional[str], str]: + """ + Make an API call to Anthropic. + + Args: + model: Anthropic model name + messages: List of messages in chat format + temperature: Temperature for generation + + Returns: + Tuple of (response_content, raw_response) + """ + try: + # Convert OpenAI format to Anthropic format + system_message = None + user_messages = [] + + for msg in messages: + if msg["role"] == "system": + system_message = msg["content"] + else: + user_messages.append(msg) + + # Prepare API call parameters + api_params = { + "model": model, + "messages": user_messages, + "max_tokens": 4000, # Anthropic requires max_tokens + "temperature": temperature, + } + + if system_message: + api_params["system"] = system_message + + # Make the API call + response = await self.client.messages.create(**api_params) + + # Extract response content + content = "" + if response.content: + for block in response.content: + if hasattr(block, 'text'): + content += block.text + + return content, content + + except RateLimitError as e: + # Handle rate limiting with special logic + error_str = str(e) + print(f"🚫 RateLimitError: {error_str}") + + if "insufficient_quota" in error_str.lower(): + print("⏳ Detected quota exhaustion - sleeping 15 minutes") + await asyncio.sleep(900) # 15 minutes + else: + # Standard rate limit - shorter sleep + sleep_time = 2 + random.random() + print(f" ⏰ Rate limited, sleeping {sleep_time:.1f}s") + await asyncio.sleep(sleep_time) + + # Re-raise to trigger retry logic + raise + + except (APIError, APIConnectionError) as e: + print(f"❌ Anthropic API Error: {str(e)}") + raise + + except Exception as e: + print(f"❌ Unexpected error in Anthropic API call: {str(e)}") + raise + + def get_model_info(self) -> Dict[str, str]: + """Get information about the configured models.""" + return { + "solver_model": self.solver_model, + "grader_model": self.grader_model, + "provider": "anthropic" + } + + async def health_check(self) -> bool: + """ + Perform a simple health check to verify API connectivity. + + Returns: + True if API is accessible, False otherwise + """ + try: + # Simple test call + test_messages = [ + {"role": "user", "content": "Hello, please respond with a simple JSON: {\"status\": \"ok\"}"} + ] + + result, _ = await self._call_api( + model=self.solver_model, + messages=test_messages, + temperature=0.0 + ) + + if result and "ok" in result.lower(): + print(f"✅ Anthropic API health check passed for {self.solver_model}") + return True + else: + print(f"⚠️ Anthropic API health check returned unexpected response") + return False + + except Exception as e: + print(f"❌ Anthropic API health check failed: {str(e)}") + return False + + async def estimate_cost(self, + num_problems: int, + avg_problem_length: int = 1000, + avg_solution_length: int = 2000) -> Dict[str, float]: + """ + Estimate the cost for processing a given number of problems. + + Args: + num_problems: Number of problems to process + avg_problem_length: Average length of problem statements in characters + avg_solution_length: Average length of solutions in characters + + Returns: + Dictionary with cost estimates + """ + # Rough token estimates (1 token ≈ 4 characters for English) + tokens_per_solve = (avg_problem_length + avg_solution_length) // 4 + tokens_per_grade = (avg_problem_length + avg_solution_length * 2) // 4 + + # Anthropic pricing (update with actual Anthropic pricing) + # These are rough estimates and should be updated with current pricing + pricing = { + "claude-3-5-haiku-20241022": {"input": 0.0008, "output": 0.004}, # per 1K tokens + "claude-3-5-sonnet-20241022": {"input": 0.003, "output": 0.015}, # per 1K tokens + "claude-3-opus-20240229": {"input": 0.015, "output": 0.075}, # per 1K tokens + "claude-3-haiku-20240307": {"input": 0.00025, "output": 0.00125}, # per 1K tokens + } + + def get_model_cost(model: str, input_tokens: int, output_tokens: int) -> float: + if model not in pricing: + model = "claude-3-5-sonnet-20241022" # Default fallback + + input_cost = (input_tokens / 1000) * pricing[model]["input"] + output_cost = (output_tokens / 1000) * pricing[model]["output"] + return input_cost + output_cost + + # Calculate costs + solve_cost = get_model_cost( + self.solver_model, + tokens_per_solve * num_problems, + tokens_per_solve * num_problems // 2 # Assume output is ~50% of input + ) + + grade_cost = get_model_cost( + self.grader_model, + tokens_per_grade * num_problems, + tokens_per_grade * num_problems // 3 # Assume output is ~33% of input + ) + + total_cost = solve_cost + grade_cost + + return { + "solve_cost": round(solve_cost, 4), + "grade_cost": round(grade_cost, 4), + "total_cost": round(total_cost, 4), + "cost_per_problem": round(total_cost / num_problems, 6), + "currency": "USD" + } diff --git a/putnam-bench-anon/loader/base.py b/putnam-bench-anon/loader/base.py new file mode 100644 index 0000000..5e24a8f --- /dev/null +++ b/putnam-bench-anon/loader/base.py @@ -0,0 +1,514 @@ +""" +Abstract base class for model loaders. +Defines the interface for mathematical problem solving and grading. +""" + +import re +import json +import asyncio +import random +from abc import ABC, abstractmethod +from typing import Dict, List, Tuple, Optional, Any + +from .prompts import ( + SOLVER_SYSTEM_PROMPT, + SOLVER_USER_TEMPLATE, + PROOF_GRADER_SYSTEM_PROMPT, + CALCULATION_GRADER_SYSTEM_PROMPT, + PROOF_GRADER_USER_TEMPLATE, + CALCULATION_GRADER_USER_TEMPLATE, + RESPONSE_FORMAT, + DEFAULT_RETRIES, + DEFAULT_TIMEOUT_BASE +) + +# JSON extraction regex +JSON_RE = re.compile(r"\{[\s\S]*\}") + + +class ModelLoader(ABC): + """Abstract base class for model loaders.""" + + def __init__(self, + solver_model: str, + grader_model: str, + retries: int = DEFAULT_RETRIES, + timeout_base: int = DEFAULT_TIMEOUT_BASE, + debug: bool = False, + quick: bool = False): + """ + Initialize the model loader. + + Args: + solver_model: Model name for solving problems + grader_model: Model name for grading solutions + retries: Number of retry attempts for API calls + timeout_base: Base timeout in seconds for API calls + debug: Enable debug logging for JSON parsing + quick: Quick mode - allows one retry with 1200s timeout each attempt + """ + self.solver_model = solver_model + self.grader_model = grader_model + self.retries = retries + self.timeout_base = timeout_base + self.debug = debug + self.quick = quick + + # Override settings for quick mode + if self.quick: + self.retries = 1 # Only try once + self.timeout_base = 1200 # 20 minutes timeout + + @abstractmethod + async def _call_api(self, + model: str, + messages: List[Dict[str, str]], + temperature: float = 0.0) -> Tuple[Optional[str], str]: + """ + Make an API call to the model. + + Args: + model: Model name to use + messages: List of messages in chat format + temperature: Temperature for generation + + Returns: + Tuple of (parsed_response, raw_response) + """ + pass + + def parse_json_response(self, raw: str, debug: bool = False) -> Optional[Dict]: + """Parse JSON from LLM response with fallback strategies.""" + if not raw: + return None + + # Try direct JSON parse + try: + return json.loads(raw) + except Exception as e: + if debug: + print(f"⚠️ Direct JSON parse failed: {e}") + + # Try to find JSON in the response + match = JSON_RE.search(raw) + if match: + try: + return json.loads(match.group(0)) + except Exception as e: + if debug: + print(f"⚠️ Regex JSON parse failed: {e}") + + # Try fixing common JSON issues including control characters + try: + # Fix escaped quotes and backslashes + fixed = raw.replace('\\"', '"').replace('\\\\', '\\') + + # Fix unescaped newlines and other control characters in JSON strings + # This is a more robust approach for LLM responses + import ast + + # Try to use ast.literal_eval if it's a simple dict-like structure + if fixed.strip().startswith('{') and fixed.strip().endswith('}'): + try: + # Replace common problematic patterns + fixed = fixed.replace('\n', '\\n').replace('\r', '\\r').replace('\t', '\\t') + return json.loads(fixed) + except Exception as e: + if debug: + print(f"⚠️ Fixed JSON parse failed: {e}") + + except Exception as e: + if debug: + print(f"⚠️ JSON fixing failed: {e}") + + # ENHANCED: Try to complete truncated JSON + try: + if raw.strip().startswith('{') and not raw.strip().endswith('}'): + if debug: + print("🔧 Attempting to fix truncated JSON...") + + # Try to find the last complete key-value pair + # Look for solution content + if '"solution"' in raw: + # Extract solution up to the truncation point + solution_start = raw.find('"solution"') + solution_content = raw[solution_start:] + + # Find the actual solution text + import re + solution_match = re.search(r'"solution":\s*"([^"]*(?:\\"[^"]*)*)', raw, re.DOTALL) + if solution_match: + solution_text = solution_match.group(1) + # Clean up the solution text + solution_text = solution_text.replace('\\"', '"').replace('\\n', '\n') + + if debug: + print(f"🔧 Extracted solution from truncated JSON ({len(solution_text)} chars)") + return { + "solution": solution_text, + "final_answer": "Solution was truncated - see solution field for complete answer" + } + except Exception as e: + if debug: + print(f"⚠️ Truncated JSON recovery failed: {e}") + + # Final fallback: try to extract key-value pairs manually + try: + if '"solution"' in raw: + import re + + if debug: + print("🔧 Attempting manual key-value extraction...") + + # Extract solution (more robust pattern) + solution_match = re.search(r'"solution":\s*"([^"]*(?:\\"[^"]*)*)', raw, re.DOTALL) + solution = solution_match.group(1) if solution_match else "" + + # Extract final_answer if it exists + answer_match = re.search(r'"final_answer":\s*"([^"]*)"', raw) + final_answer = answer_match.group(1) if answer_match else "" + + if solution: + # Clean up the solution text + solution = solution.replace('\\"', '"').replace('\\n', '\n') + + if debug: + print(f"🔧 Manual extraction successful ({len(solution)} chars solution)") + return { + "solution": solution, + "final_answer": final_answer if final_answer else "See solution field for complete answer" + } + except Exception as e: + if debug: + print(f"⚠️ Manual extraction failed: {e}") + + if debug: + print("❌ All JSON parsing strategies failed") + return None + + def to_str(self, x) -> str: + """Convert various types to string safely.""" + if x is None: + return "" + if isinstance(x, str): + return x + if isinstance(x, (list, tuple)): + return "\n".join(map(str, x)) + return str(x) + + async def call_api_with_retry(self, + model: str, + messages: List[Dict[str, str]], + temperature: float = 0.0) -> Tuple[Optional[Dict], str]: + """ + Make API call with retry logic and JSON parsing. + + Args: + model: Model name to use + messages: List of messages in chat format + temperature: Temperature for generation + + Returns: + Tuple of (parsed_json_response, raw_response) + """ + raw_response = "" + + # In quick mode, we allow one retry with a fixed timeout + if self.quick: + max_attempts = 2 # Allow one retry in quick mode + if self.debug: + print(f"⚡ Quick mode: Up to {max_attempts} attempts with {self.timeout_base}s timeout each") + + for attempt in range(1, max_attempts + 1): + try: + if attempt > 1 and self.debug: + print(f"🔄 Quick mode retry attempt {attempt}/{max_attempts}") + + parsed, raw_response = await asyncio.wait_for( + self._call_api(model, messages, temperature), + timeout=self.timeout_base + ) + + if parsed: + # Try to parse as JSON + debug_mode = getattr(self, 'debug', False) + json_parsed = self.parse_json_response(parsed, debug=debug_mode) + if json_parsed: + return json_parsed, raw_response + return None, raw_response + else: + raise ValueError("Empty response from API") + + except Exception as e: + error_type = type(e).__name__ + error_msg = str(e) + print(f"❌ {error_type} in quick mode (attempt {attempt}/{max_attempts}): {error_msg}") + + # If this was the last attempt, mark as failed + if attempt == max_attempts: + return {"_max_retries_reached": True, "error": str(e)}, raw_response + + # Otherwise, wait a bit before retrying + if self.debug: + print("⏳ Waiting 5 seconds before retry...") + await asyncio.sleep(5) + + # Regular mode with retries + for attempt in range(1, self.retries + 1): + # More aggressive timeout scaling for persistent failures + # Cap timeout at 30 minutes to prevent extremely long waits + timeout = min(self.timeout_base * (1.5 ** (attempt - 1)), 1800) + if self.debug: + print(f"🔄 Attempt {attempt}/{self.retries} with timeout {timeout:.0f}s") + try: + parsed, raw_response = await asyncio.wait_for( + self._call_api(model, messages, temperature), + timeout=timeout + ) + + if parsed: + # Try to parse as JSON + debug_mode = getattr(self, 'debug', False) + json_parsed = self.parse_json_response(parsed, debug=debug_mode) + if json_parsed: + return json_parsed, raw_response + return None, raw_response + else: + raise ValueError("Empty response from API") + + except Exception as e: + error_type = type(e).__name__ + error_msg = str(e) + + # Only show detailed error info on first attempt or in debug mode + if attempt == 1 or self.debug: + print(f"❌ {error_type} (attempt {attempt}/{self.retries}): {error_msg}") + + if attempt == self.retries: + print(f"🔥 All {self.retries} retry attempts exhausted for {error_type}") + # Return a special marker for max retries reached + return {"_max_retries_reached": True, "error": str(e)}, raw_response + + # Custom retry strategy: 600s -> 900s -> 900s -> 1200s... + if attempt == 1: + base_sleep = 600 # 10 minutes + elif attempt == 2 or attempt == 3: + base_sleep = 900 # 15 minutes + else: + base_sleep = 1200 # 20 minutes + + # Add small random jitter to avoid synchronized retries + jitter = random.uniform(0, 30) # 0-30 seconds jitter + sleep_time = base_sleep + jitter + + if self.debug: + print(f" ⏰ Using custom backoff strategy: {base_sleep}s base + {jitter:.1f}s jitter") + + if self.debug: + print(f" ⏰ Retrying in {sleep_time:.1f}s") + await asyncio.sleep(sleep_time) + + return None, raw_response + + async def solve_problem(self, problem_statement: str, model: Optional[str] = None) -> Tuple[Optional[Dict], str]: + """ + Have model solve mathematical problems. + + Args: + problem_statement: Problem statement + model: Model name to use for solving (if None, uses default solver_model) + + Returns: + Tuple of (solving result dictionary, raw response) + Solving result contains: {"solution": "detailed solution", "final_answer": "final answer"} + """ + messages = [ + {"role": "system", "content": SOLVER_SYSTEM_PROMPT}, + {"role": "user", "content": SOLVER_USER_TEMPLATE.format( + problem_statement=problem_statement + )} + ] + + # Use specified model or default solver model + solver_model = model if model is not None else self.solver_model + + # Set temperature based on model + # o3, o3-mini, and o4-mini require temperature 1.0 + if any(model_name in solver_model.lower() for model_name in ['o3', 'o3-mini', 'o4-mini']): + temperature = 1.0 + else: + # Use temperature 0.0 for deterministic solving with other models + temperature = 0.0 + + return await self.call_api_with_retry(solver_model, messages, temperature=temperature) + + async def grade_solution(self, + problem_statement: str, + solution: str, + reference_solution: str, + problem_type: str = "proof", + model: Optional[str] = None) -> Tuple[Optional[Dict], str]: + """ + Have model grade solution based on problem type. + + Args: + problem_statement: Problem statement + solution: Student solution + reference_solution: Reference solution + problem_type: Problem type ("proof" strict grading, "calculation" lenient grading) + model: Model name to use for grading (if None, uses default grader_model) + + Returns: + Tuple of (grading result dictionary, raw response) + Grading result contains: {"grade": "CORRECT"/"INCORRECT", "detailed_feedback": "...", ...} + """ + if problem_type == "calculation": + system_prompt = CALCULATION_GRADER_SYSTEM_PROMPT + user_template = CALCULATION_GRADER_USER_TEMPLATE + else: # Default to proof (strict grading) + system_prompt = PROOF_GRADER_SYSTEM_PROMPT + user_template = PROOF_GRADER_USER_TEMPLATE + + messages = [ + {"role": "system", "content": system_prompt}, + {"role": "user", "content": user_template.format( + problem_statement=problem_statement, + solution=solution, + reference_solution=reference_solution + )} + ] + + # Use specified model or default grader model + grader_model = model if model is not None else self.grader_model + + # Use temperature 1.0 for grading (as per original script for o3) + return await self.call_api_with_retry(grader_model, messages, temperature=1.0) + + async def test_single_problem(self, + data: Dict, + variant_type: str = "original", + solver_model: Optional[str] = None, + grader_model: Optional[str] = None) -> Dict: + """ + Test complete workflow for single problem: solving + grading. + + Args: + data: Problem data dictionary + variant_type: Problem variant type ("original" or key names in variants) + solver_model: Model name for solving (if None, uses default solver_model) + grader_model: Model name for grading (if None, uses default grader_model) + + Returns: + Test result dictionary + """ + index = data.get("index", "unknown") + problem_type = data.get("problem_type", "proof") + + try: + # Get problem and reference solution + if variant_type == "original": + question = self.to_str(data.get("question", "")).strip() + reference_solution = self.to_str(data.get("solution", "")).strip() + else: + variant = data.get("variants", {}).get(variant_type) + if not variant: + return { + "index": index, + "variant_type": variant_type, + "status": "skipped", + "reason": f"no_{variant_type}_variant" + } + question = self.to_str(variant.get("question", "")).strip() + reference_solution = self.to_str(variant.get("solution", "")).strip() + + if not question or not reference_solution: + return { + "index": index, + "variant_type": variant_type, + "status": "skipped", + "reason": "missing_fields" + } + + result = { + "index": index, + "variant_type": variant_type, + "problem_type": problem_type, + "status": "completed", + "solve": {}, + "grade": {} + } + + # 1. Solve problem + solve_result, solve_raw = await self.solve_problem(question, model=solver_model) + + # Check if max retries reached + if solve_result and solve_result.get("_max_retries_reached"): + # Mark as completed but with INCORRECT grade due to max retries + result["solve"]["status"] = "max_retries" + result["solve"]["solution"] = "Failed to generate solution after maximum retry attempts" + result["solve"]["final_answer"] = "No answer - max retries reached" + result["grade"]["status"] = "auto_failed" + result["grade"]["grade"] = "INCORRECT" + result["grade"]["detailed_feedback"] = f"Automatically marked as incorrect due to reaching maximum retry limit ({self.retries} attempts)" + result["grade"]["major_issues"] = "API call failed after all retry attempts" + result["grade"]["final_answer_correct"] = False + result["grade"]["reasoning_rigor_score"] = 0 + result["grade"]["overall_assessment"] = "Failed to generate solution" + result["correct"] = False + result["status"] = "completed" # Mark as completed, not failed + return result + + if not solve_result: + result["solve"]["status"] = "failed" + result["status"] = "failed" + return result + + student_solution = self.to_str(solve_result.get("solution", "")).strip() + final_answer = self.to_str(solve_result.get("final_answer", "")).strip() + + result["solve"]["status"] = "success" + result["solve"]["solution"] = student_solution + result["solve"]["final_answer"] = final_answer + + # 2. Grade solution + grade_result, grade_raw = await self.grade_solution( + question, student_solution, reference_solution, problem_type, model=grader_model + ) + + # Check if grading max retries reached + if grade_result and grade_result.get("_max_retries_reached"): + # Mark as completed but with INCORRECT grade due to max retries in grading + result["grade"]["status"] = "auto_failed" + result["grade"]["grade"] = "INCORRECT" + result["grade"]["detailed_feedback"] = f"Automatically marked as incorrect due to grading reaching maximum retry limit ({self.retries} attempts)" + result["grade"]["major_issues"] = "Grading API call failed after all retry attempts" + result["grade"]["final_answer_correct"] = False + result["grade"]["reasoning_rigor_score"] = 0 + result["grade"]["overall_assessment"] = "Failed to grade solution" + result["correct"] = False + result["status"] = "completed" # Mark as completed, not partial/failed + elif not grade_result: + result["grade"]["status"] = "failed" + result["status"] = "partial" # solving succeeded but grading failed + else: + result["grade"]["status"] = "success" + result["grade"]["grade"] = grade_result.get("grade", "UNKNOWN") + result["grade"]["detailed_feedback"] = grade_result.get("detailed_feedback", "") + result["grade"]["major_issues"] = grade_result.get("major_issues", "") + result["grade"]["final_answer_correct"] = grade_result.get("final_answer_correct", False) + result["grade"]["reasoning_rigor_score"] = grade_result.get("reasoning_rigor_score", 0) + result["grade"]["overall_assessment"] = grade_result.get("overall_assessment", "") + + # Mark whether correct + result["correct"] = grade_result.get("grade") == "CORRECT" + + return result + + except Exception as e: + return { + "index": index, + "variant_type": variant_type, + "status": "error", + "error": str(e), + "error_type": type(e).__name__ + } diff --git a/putnam-bench-anon/loader/cross_provider.py b/putnam-bench-anon/loader/cross_provider.py new file mode 100644 index 0000000..afd833c --- /dev/null +++ b/putnam-bench-anon/loader/cross_provider.py @@ -0,0 +1,155 @@ +""" +Cross-provider model loader implementation. +Allows using different providers for solving and grading tasks. +""" + +from typing import Dict, Optional, Tuple, Any +from .base import ModelLoader + + +class CrossProviderLoader(ModelLoader): + """Wrapper that allows using different providers for solving and grading.""" + + def __init__(self, + solver_loader: ModelLoader, + grader_loader: Optional[ModelLoader] = None, + **kwargs): + """ + Initialize cross-provider loader. + + Args: + solver_loader: ModelLoader instance for solving problems + grader_loader: ModelLoader instance for grading (if None, uses solver_loader) + **kwargs: Additional arguments passed to parent class + """ + # If no grader loader specified, use the solver loader for both + self.solver_loader = solver_loader + self.grader_loader = grader_loader or solver_loader + + # Initialize parent with combined model info + super().__init__( + solver_model=solver_loader.solver_model, + grader_model=self.grader_loader.grader_model, + **kwargs + ) + + # Track if we're using cross-provider + self.is_cross_provider = grader_loader is not None and grader_loader != solver_loader + + async def _call_api(self, + model: str, + messages: list[Dict[str, str]], + temperature: float = 0.0) -> Tuple[Optional[str], str]: + """ + Route API calls to the appropriate provider based on the model. + + Args: + model: Model name to use + messages: List of messages in chat format + temperature: Temperature for generation + + Returns: + Tuple of (response_content, raw_response) + """ + # Determine which loader to use based on the model + if model == self.solver_model: + return await self.solver_loader._call_api(model, messages, temperature) + elif model == self.grader_model: + return await self.grader_loader._call_api(model, messages, temperature) + else: + # Try to determine based on which loader has the model + if hasattr(self.solver_loader, 'solver_model') and model == self.solver_loader.solver_model: + return await self.solver_loader._call_api(model, messages, temperature) + elif hasattr(self.grader_loader, 'grader_model') and model == self.grader_loader.grader_model: + return await self.grader_loader._call_api(model, messages, temperature) + else: + raise ValueError(f"Model {model} not found in either solver or grader loader") + + def get_model_info(self) -> Dict[str, Any]: + """Get information about the configured models and providers.""" + solver_info = self.solver_loader.get_model_info() + grader_info = self.grader_loader.get_model_info() + + return { + "solver_model": self.solver_model, + "grader_model": self.grader_model, + "solver_provider": solver_info.get("provider", "unknown"), + "grader_provider": grader_info.get("provider", "unknown"), + "is_cross_provider": self.is_cross_provider, + "solver_info": solver_info, + "grader_info": grader_info + } + + async def health_check(self) -> bool: + """ + Perform health checks on both providers. + + Returns: + True if both providers are healthy, False otherwise + """ + print("🔍 Checking solver provider health...") + solver_health = await self.solver_loader.health_check() + + if self.is_cross_provider: + print("🔍 Checking grader provider health...") + grader_health = await self.grader_loader.health_check() + return solver_health and grader_health + else: + return solver_health + + async def estimate_cost(self, + num_problems: int, + avg_problem_length: int = 1000, + avg_solution_length: int = 2000) -> Dict[str, float]: + """ + Estimate costs for both providers. + + Args: + num_problems: Number of problems to process + avg_problem_length: Average length of problem statements in characters + avg_solution_length: Average length of solutions in characters + + Returns: + Dictionary with combined cost estimates + """ + # Get solver costs + solver_costs = await self.solver_loader.estimate_cost( + num_problems, avg_problem_length, avg_solution_length + ) + + if self.is_cross_provider: + # Get grader costs separately + grader_costs = await self.grader_loader.estimate_cost( + num_problems, avg_problem_length, avg_solution_length + ) + + # Combine costs + return { + "solver_cost": solver_costs.get("solve_cost", 0), + "grader_cost": grader_costs.get("grade_cost", 0), + "total_cost": solver_costs.get("solve_cost", 0) + grader_costs.get("grade_cost", 0), + "solver_provider": self.solver_loader.get_model_info().get("provider"), + "grader_provider": self.grader_loader.get_model_info().get("provider"), + "solver_model": self.solver_model, + "grader_model": self.grader_model, + "num_problems": num_problems, + "note": "Cross-provider costs combined" + } + else: + # Single provider costs + return solver_costs + + async def __aenter__(self): + """Async context manager entry.""" + if hasattr(self.solver_loader, '__aenter__'): + await self.solver_loader.__aenter__() + if self.is_cross_provider and hasattr(self.grader_loader, '__aenter__'): + await self.grader_loader.__aenter__() + return self + + async def __aexit__(self, exc_type, exc_val, exc_tb): + """Async context manager exit.""" + if hasattr(self.solver_loader, '__aexit__'): + await self.solver_loader.__aexit__(exc_type, exc_val, exc_tb) + if self.is_cross_provider and hasattr(self.grader_loader, '__aexit__'): + await self.grader_loader.__aexit__(exc_type, exc_val, exc_tb)
\ No newline at end of file diff --git a/putnam-bench-anon/loader/gemini_client.py b/putnam-bench-anon/loader/gemini_client.py new file mode 100644 index 0000000..3ff0be0 --- /dev/null +++ b/putnam-bench-anon/loader/gemini_client.py @@ -0,0 +1,239 @@ +""" +Gemini model loader implementation. +Handles API calls to Google Gemini models with proper error handling and retry logic. +""" + +import asyncio +import random +from typing import Dict, List, Tuple, Optional + +try: + import google.generativeai as genai + from google.generativeai.types import generation_types +except ImportError: + genai = None + generation_types = None + +from .base import ModelLoader +from .prompts import RESPONSE_FORMAT + + +class GeminiModelLoader(ModelLoader): + """Gemini implementation of the ModelLoader.""" + + def __init__(self, + solver_model: str = "gemini-1.5-flash", + grader_model: str = "gemini-1.5-pro", + api_key: Optional[str] = None, + **kwargs): + """ + Initialize Gemini model loader. + + Args: + solver_model: Gemini model for solving problems (default: gemini-1.5-flash) + grader_model: Gemini model for grading solutions (default: gemini-1.5-pro) + api_key: Google AI API key (if None, uses environment variable GOOGLE_API_KEY) + **kwargs: Additional arguments passed to parent class + """ + if genai is None: + raise ImportError( + "google-generativeai package is required for GeminiModelLoader. " + "Install with: pip install google-generativeai" + ) + + super().__init__(solver_model, grader_model, **kwargs) + + # Configure Google AI + if api_key: + genai.configure(api_key=api_key) + else: + # Will use GOOGLE_API_KEY environment variable + genai.configure() + + async def _call_api(self, + model: str, + messages: List[Dict[str, str]], + temperature: float = 0.0) -> Tuple[Optional[str], str]: + """ + Make an API call to Gemini. + + Args: + model: Gemini model name + messages: List of messages in chat format + temperature: Temperature for generation + + Returns: + Tuple of (response_content, raw_response) + """ + try: + # Initialize the model + model_instance = genai.GenerativeModel(model) + + # Convert OpenAI format to Gemini format + system_instruction = None + conversation = [] + + for msg in messages: + if msg["role"] == "system": + system_instruction = msg["content"] + elif msg["role"] == "user": + conversation.append({"role": "user", "parts": [msg["content"]]}) + elif msg["role"] == "assistant": + conversation.append({"role": "model", "parts": [msg["content"]]}) + + # Configure generation parameters + generation_config = genai.types.GenerationConfig( + temperature=temperature, + max_output_tokens=4000, + ) + + # Request JSON format for all Gemini models + # Flash models now support JSON format as per latest API documentation + generation_config.response_mime_type = "application/json" + + # Make the API call + if system_instruction and len(conversation) == 1: + # Single user message with system instruction + prompt = f"{system_instruction}\n\n{conversation[0]['parts'][0]}" + response = await asyncio.to_thread( + model_instance.generate_content, + prompt, + generation_config=generation_config + ) + else: + # Multi-turn conversation + if system_instruction: + # Prepend system instruction to first user message + if conversation and conversation[0]["role"] == "user": + conversation[0]["parts"][0] = f"{system_instruction}\n\n{conversation[0]['parts'][0]}" + + response = await asyncio.to_thread( + model_instance.generate_content, + conversation, + generation_config=generation_config + ) + + # Extract response content + content = "" + if response.text: + content = response.text + + return content, content + + except Exception as e: + error_str = str(e) + + # Handle different types of errors + if "quota" in error_str.lower() or "rate" in error_str.lower(): + print(f"🚫 Rate/Quota Error: {error_str}") + if "quota" in error_str.lower(): + print("⏳ Detected quota exhaustion - sleeping 15 minutes") + await asyncio.sleep(900) # 15 minutes + else: + sleep_time = 2 + random.random() + print(f" ⏰ Rate limited, sleeping {sleep_time:.1f}s") + await asyncio.sleep(sleep_time) + # Re-raise to trigger retry logic + raise + elif "api" in error_str.lower(): + print(f"❌ Gemini API Error: {error_str}") + raise + else: + print(f"❌ Unexpected error in Gemini API call: {error_str}") + raise + + def get_model_info(self) -> Dict[str, str]: + """Get information about the configured models.""" + return { + "solver_model": self.solver_model, + "grader_model": self.grader_model, + "provider": "gemini" + } + + async def health_check(self) -> bool: + """ + Perform a simple health check to verify API connectivity. + + Returns: + True if API is accessible, False otherwise + """ + try: + # Simple test call + test_messages = [ + {"role": "user", "content": "Hello, please respond with a simple JSON: {\"status\": \"ok\"}"} + ] + + result, _ = await self._call_api( + model=self.solver_model, + messages=test_messages, + temperature=0.0 + ) + + if result and "ok" in result.lower(): + print(f"✅ Gemini API health check passed for {self.solver_model}") + return True + else: + print(f"⚠️ Gemini API health check returned unexpected response") + return False + + except Exception as e: + print(f"❌ Gemini API health check failed: {str(e)}") + return False + + async def estimate_cost(self, + num_problems: int, + avg_problem_length: int = 1000, + avg_solution_length: int = 2000) -> Dict[str, float]: + """ + Estimate the cost for processing a given number of problems. + + Args: + num_problems: Number of problems to process + avg_problem_length: Average length of problem statements in characters + avg_solution_length: Average length of solutions in characters + + Returns: + Dictionary with cost estimates + """ + # Rough token estimates (1 token ≈ 4 characters for English) + tokens_per_solve = (avg_problem_length + avg_solution_length) // 4 + tokens_per_grade = (avg_problem_length + avg_solution_length * 2) // 4 + + # Gemini pricing (update with actual Google AI pricing) + # These are rough estimates and should be updated with current pricing + pricing = { + "gemini-1.5-flash": {"input": 0.000075, "output": 0.0003}, # per 1K tokens + "gemini-1.5-pro": {"input": 0.00125, "output": 0.005}, # per 1K tokens + "gemini-1.0-pro": {"input": 0.0005, "output": 0.0015}, # per 1K tokens + } + + def get_model_cost(model: str, input_tokens: int, output_tokens: int) -> float: + if model not in pricing: + model = "gemini-1.5-pro" # Default fallback + + input_cost = (input_tokens / 1000) * pricing[model]["input"] + output_cost = (output_tokens / 1000) * pricing[model]["output"] + return input_cost + output_cost + + # Calculate costs + solve_cost = get_model_cost( + self.solver_model, + tokens_per_solve * num_problems, + tokens_per_solve * num_problems // 2 # Assume output is ~50% of input + ) + + grade_cost = get_model_cost( + self.grader_model, + tokens_per_grade * num_problems, + tokens_per_grade * num_problems // 3 # Assume output is ~33% of input + ) + + total_cost = solve_cost + grade_cost + + return { + "solve_cost": round(solve_cost, 4), + "grade_cost": round(grade_cost, 4), + "total_cost": round(total_cost, 4), + "cost_per_problem": round(total_cost / num_problems, 6), + "currency": "USD" + } diff --git a/putnam-bench-anon/loader/hf_local.py b/putnam-bench-anon/loader/hf_local.py new file mode 100644 index 0000000..9371436 --- /dev/null +++ b/putnam-bench-anon/loader/hf_local.py @@ -0,0 +1,375 @@ +""" +Hugging Face local model loader implementation. +Handles direct inference with locally loaded transformers models. +""" + +import asyncio +import random +from typing import Dict, List, Tuple, Optional +import json + +try: + import torch + from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline + import transformers +except ImportError: + torch = None + AutoModelForCausalLM = None + AutoTokenizer = None + pipeline = None + transformers = None + +from .base import ModelLoader + + +class HuggingFaceModelLoader(ModelLoader): + """Hugging Face local model implementation of the ModelLoader.""" + + def __init__(self, + solver_model: str = "microsoft/DialoGPT-medium", + grader_model: str = "microsoft/DialoGPT-large", + device: str = "auto", + max_length: int = 4000, + **kwargs): + """ + Initialize Hugging Face model loader. + + Args: + solver_model: HuggingFace model name for solving problems + grader_model: HuggingFace model name for grading solutions + device: Device to run models on ("auto", "cuda", "cpu") + max_length: Maximum generation length + **kwargs: Additional arguments passed to parent class + """ + if transformers is None or torch is None: + raise ImportError( + "transformers and torch packages are required for HuggingFaceModelLoader. " + "Install with: pip install transformers torch" + ) + + super().__init__(solver_model, grader_model, **kwargs) + + # Device setup + if device == "auto": + self.device = "cuda" if torch.cuda.is_available() else "cpu" + else: + self.device = device + + self.max_length = max_length + + # Model and tokenizer caches + self._models = {} + self._tokenizers = {} + self._pipelines = {} + + print(f"🔧 HuggingFace loader initialized on device: {self.device}") + + async def _load_model(self, model_name: str) -> Tuple[AutoModelForCausalLM, AutoTokenizer]: + """Load model and tokenizer, with caching.""" + if model_name not in self._models: + print(f"📥 Loading model: {model_name}") + + try: + # Load in a separate thread to avoid blocking + tokenizer = await asyncio.to_thread( + AutoTokenizer.from_pretrained, + model_name, + trust_remote_code=True + ) + + model = await asyncio.to_thread( + AutoModelForCausalLM.from_pretrained, + model_name, + torch_dtype=torch.float16 if self.device == "cuda" else torch.float32, + device_map="auto" if self.device == "cuda" else None, + trust_remote_code=True + ) + + if self.device == "cpu": + model = model.to(self.device) + + # Set pad token if not present + if tokenizer.pad_token is None: + tokenizer.pad_token = tokenizer.eos_token + + self._models[model_name] = model + self._tokenizers[model_name] = tokenizer + + print(f"✅ Model loaded successfully: {model_name}") + + except Exception as e: + print(f"❌ Failed to load model {model_name}: {str(e)}") + raise + + return self._models[model_name], self._tokenizers[model_name] + + async def _call_api(self, + model: str, + messages: List[Dict[str, str]], + temperature: float = 0.0) -> Tuple[Optional[str], str]: + """ + Make a local inference call using the HuggingFace model. + + Args: + model: Model name to use + messages: List of messages in chat format + temperature: Temperature for generation + + Returns: + Tuple of (response_content, raw_response) + """ + try: + # Load model and tokenizer + hf_model, tokenizer = await self._load_model(model) + + # Convert messages to prompt format + prompt = self._format_messages(messages) + + # Generate response + response = await self._generate_response( + hf_model, tokenizer, prompt, temperature + ) + + return response, response + + except Exception as e: + print(f"❌ HuggingFace inference error: {str(e)}") + raise + + def _format_messages(self, messages: List[Dict[str, str]]) -> str: + """Convert OpenAI message format to a prompt string.""" + prompt_parts = [] + + for msg in messages: + role = msg["role"] + content = msg["content"] + + if role == "system": + prompt_parts.append(f"System: {content}") + elif role == "user": + prompt_parts.append(f"User: {content}") + elif role == "assistant": + prompt_parts.append(f"Assistant: {content}") + + prompt_parts.append("Assistant:") + return "\n\n".join(prompt_parts) + + async def _generate_response(self, + model: AutoModelForCausalLM, + tokenizer: AutoTokenizer, + prompt: str, + temperature: float) -> str: + """Generate response using the loaded model.""" + + # Tokenize input + inputs = await asyncio.to_thread( + tokenizer.encode, + prompt, + return_tensors="pt", + truncation=True, + max_length=2048 # Leave room for generation + ) + + if self.device == "cuda": + inputs = inputs.to(self.device) + + # Generation parameters + gen_kwargs = { + "max_new_tokens": min(self.max_length, 2048), + "temperature": max(temperature, 0.1), # Avoid 0 temperature + "do_sample": temperature > 0.0, + "pad_token_id": tokenizer.eos_token_id, + "eos_token_id": tokenizer.eos_token_id, + "attention_mask": torch.ones_like(inputs) + } + + if temperature > 0.0: + gen_kwargs.update({ + "top_p": 0.9, + "top_k": 50 + }) + + # Generate + with torch.no_grad(): + outputs = await asyncio.to_thread( + model.generate, + inputs, + **gen_kwargs + ) + + # Decode response + generated_text = await asyncio.to_thread( + tokenizer.decode, + outputs[0][inputs.shape[1]:], # Only new tokens + skip_special_tokens=True + ) + + return generated_text.strip() + + def get_model_info(self) -> Dict[str, str]: + """Get information about the configured models.""" + return { + "solver_model": self.solver_model, + "grader_model": self.grader_model, + "provider": "huggingface", + "device": self.device, + "loaded_models": list(self._models.keys()) + } + + async def health_check(self) -> bool: + """ + Perform a simple health check by testing model loading and inference. + + Returns: + True if models can be loaded and run, False otherwise + """ + try: + # Simple test + test_messages = [ + {"role": "user", "content": "Hello, please say 'ok' to confirm you're working."} + ] + + result, _ = await self._call_api( + model=self.solver_model, + messages=test_messages, + temperature=0.1 + ) + + if result and len(result) > 0: + print(f"✅ HuggingFace health check passed for {self.solver_model}") + return True + else: + print(f"⚠️ HuggingFace health check returned empty response") + return False + + except Exception as e: + print(f"❌ HuggingFace health check failed: {str(e)}") + return False + + async def estimate_cost(self, + num_problems: int, + avg_problem_length: int = 1000, + avg_solution_length: int = 2000) -> Dict[str, float]: + """ + Estimate computational cost for processing problems locally. + + Args: + num_problems: Number of problems to process + avg_problem_length: Average length of problem statements in characters + avg_solution_length: Average length of solutions in characters + + Returns: + Dictionary with cost estimates (computational cost in arbitrary units) + """ + # Rough token estimates (1 token ≈ 4 characters for English) + tokens_per_solve = (avg_problem_length + avg_solution_length) // 4 + tokens_per_grade = (avg_problem_length + avg_solution_length * 2) // 4 + + # Model size-based cost estimation (FLOPS approximation) + model_costs = { + # Small models (< 1B parameters) + "gpt2": 0.5, + "distilgpt2": 0.3, + "dialogpt-small": 0.4, + "dialogpt-medium": 0.8, + + # Medium models (1B - 10B parameters) + "dialogpt-large": 1.5, + "gpt2-medium": 1.0, + "gpt2-large": 2.0, + "gpt2-xl": 4.0, + + # Large models (10B+ parameters) + "llama-7b": 8.0, + "llama-13b": 15.0, + "llama-30b": 35.0, + "llama-65b": 70.0, + } + + def get_model_cost(model: str) -> float: + model_lower = model.lower() + for key, cost in model_costs.items(): + if key in model_lower: + return cost + + # Default based on common model sizes + if any(size in model_lower for size in ["small", "mini"]): + return 0.5 + elif any(size in model_lower for size in ["medium", "base"]): + return 1.0 + elif any(size in model_lower for size in ["large", "xl"]): + return 2.0 + else: + return 1.5 # Default for unknown models + + # Calculate computational costs + solver_cost_factor = get_model_cost(self.solver_model) + grader_cost_factor = get_model_cost(self.grader_model) + + # Device multiplier (GPU is faster but uses more power) + device_multiplier = 0.3 if self.device == "cuda" else 1.0 + + solve_cost = tokens_per_solve * num_problems * solver_cost_factor * device_multiplier / 1000 + grade_cost = tokens_per_grade * num_problems * grader_cost_factor * device_multiplier / 1000 + + total_cost = solve_cost + grade_cost + + return { + "solve_cost": round(solve_cost, 4), + "grade_cost": round(grade_cost, 4), + "total_cost": round(total_cost, 4), + "cost_per_problem": round(total_cost / num_problems, 6), + "currency": "computational_units", + "device": self.device, + "note": "Local HuggingFace costs are computational (time/energy/memory)" + } + + async def unload_model(self, model_name: str) -> bool: + """ + Unload a specific model to free memory. + + Args: + model_name: Name of the model to unload + + Returns: + True if successfully unloaded, False otherwise + """ + try: + if model_name in self._models: + del self._models[model_name] + del self._tokenizers[model_name] + + # Force garbage collection + if torch.cuda.is_available(): + torch.cuda.empty_cache() + + print(f"🗑️ Unloaded model: {model_name}") + return True + else: + print(f"⚠️ Model not loaded: {model_name}") + return False + + except Exception as e: + print(f"❌ Error unloading model {model_name}: {str(e)}") + return False + + async def unload_all_models(self) -> bool: + """ + Unload all models to free memory. + + Returns: + True if all models successfully unloaded + """ + try: + model_names = list(self._models.keys()) + success = True + + for model_name in model_names: + if not await self.unload_model(model_name): + success = False + + return success + + except Exception as e: + print(f"❌ Error unloading all models: {str(e)}") + return False diff --git a/putnam-bench-anon/loader/openai_client.py b/putnam-bench-anon/loader/openai_client.py new file mode 100644 index 0000000..fcbe247 --- /dev/null +++ b/putnam-bench-anon/loader/openai_client.py @@ -0,0 +1,603 @@ +""" +OpenAI model loader implementation. +Handles API calls to OpenAI models with proper error handling and retry logic. +""" + +import asyncio +import random +from typing import Dict, List, Tuple, Optional +import os # Added for KimiModelLoader + +from openai import AsyncOpenAI, RateLimitError, APIError, APIConnectionError, BadRequestError + +from .base import ModelLoader +from .prompts import RESPONSE_FORMAT + + +class OpenAIModelLoader(ModelLoader): + """OpenAI implementation of the ModelLoader.""" + + def __init__(self, + solver_model: str = "gpt-4o-mini", + grader_model: str = "o3", + api_key: Optional[str] = None, + base_url: Optional[str] = None, + **kwargs): + """ + Initialize OpenAI model loader. + + Args: + solver_model: OpenAI model for solving problems (default: gpt-4o-mini) + grader_model: OpenAI model for grading solutions (default: o3) + api_key: OpenAI API key (if None, uses environment variable) + base_url: Custom base URL for OpenAI API + **kwargs: Additional arguments passed to parent class + """ + super().__init__(solver_model, grader_model, **kwargs) + + # Initialize OpenAI client with custom httpx client for high concurrency + client_kwargs = {} + if api_key: + client_kwargs["api_key"] = api_key + if base_url: + client_kwargs["base_url"] = base_url + + # Configure httpx for high concurrency + import httpx + limits = httpx.Limits( + max_connections=1000, # Total connection pool size + max_keepalive_connections=500, # Persistent connections + keepalive_expiry=30.0 # Keep connections alive for 30s + ) + timeout = httpx.Timeout( + timeout=600.0, # Overall timeout (increased from 300) + connect=60.0, # Connection timeout + read=600.0, # Read timeout (increased from 300) + write=60.0 # Write timeout + ) + + http_client = httpx.AsyncClient( + limits=limits, + timeout=timeout + ) + client_kwargs["http_client"] = http_client + + self.client = AsyncOpenAI(**client_kwargs) + self._http_client = http_client # Keep reference to close later + + async def __aenter__(self): + """Async context manager entry.""" + return self + + async def __aexit__(self, exc_type, exc_val, exc_tb): + """Async context manager exit - close http client.""" + if hasattr(self, '_http_client'): + await self._http_client.aclose() + + async def _call_api(self, + model: str, + messages: List[Dict[str, str]], + temperature: float = 0.0) -> Tuple[Optional[str], str]: + """ + Make an API call to OpenAI. + + Args: + model: OpenAI model name + messages: List of messages in chat format + temperature: Temperature for generation + + Returns: + Tuple of (response_content, raw_response) + """ + try: + # Override temperature for models that require it + # o1, o3, o3-mini, and o4-mini only support temperature 1.0 + if any(model_name in model.lower() for model_name in ['o1', 'o3', 'o3-mini', 'o4-mini']): + actual_temperature = 1.0 + if self.debug and temperature != 1.0: + print(f"⚠️ Overriding temperature from {temperature} to 1.0 for model {model}") + else: + actual_temperature = temperature + + # Prepare API call parameters + api_params = { + "model": model, + "messages": messages, + "temperature": actual_temperature, + # Set max_tokens to avoid truncation + # Most OpenAI models support at least 4096, newer ones support much more + "max_tokens": 32000, # High default that works for GPT-4 and newer models + } + + # Only add response_format for models that support it + # o1 models and some older models don't support JSON format + # Note: o3 and o3-mini DO support response_format (tested and confirmed) + if not (model.startswith("o1") or model in ["gpt-4", "gpt-3.5-turbo"]): + api_params["response_format"] = RESPONSE_FORMAT + + # Remove max_tokens for models that don't support it + # o1 and o3 models don't support max_tokens parameter + if model.startswith("o1") or model.startswith("o3"): + api_params.pop("max_tokens", None) + + # Make the API call + response = await self.client.chat.completions.create(**api_params) + + # Extract response content + content = response.choices[0].message.content or "" + + return content, content + + except RateLimitError as e: + # Handle rate limiting with special logic + error_str = str(e) + if self.debug: + print(f"🚫 RateLimitError: {error_str}") + + if "insufficient_quota" in error_str: + if self.debug: + print("⏳ Detected quota exhaustion - sleeping 15 minutes") + await asyncio.sleep(900) # 15 minutes + else: + # Standard rate limit - shorter sleep + sleep_time = 2 + random.random() + if self.debug: + print(f" ⏰ Rate limited, sleeping {sleep_time:.1f}s") + await asyncio.sleep(sleep_time) + + # Re-raise to trigger retry logic + raise + + except BadRequestError as e: + # Handle policy violations and other 400 errors with special logic + error_str = str(e) + if self.debug: + print(f"🚫 BadRequestError: {error_str}") + + if "usage policy" in error_str or "flagged" in error_str: + if self.debug: + print("⏳ Detected policy violation - sleeping 30 seconds before retry") + await asyncio.sleep(30) # Longer delay for policy violations + else: + # Standard bad request - shorter sleep + sleep_time = 5 + random.random() + if self.debug: + print(f" ⏰ Bad request error, sleeping {sleep_time:.1f}s") + await asyncio.sleep(sleep_time) + + # Re-raise to trigger retry logic + raise + + except (APIError, APIConnectionError) as e: + if self.debug: + print(f"❌ OpenAI API Error: {str(e)}") + raise + + except Exception as e: + if self.debug: + print(f"❌ Unexpected error in OpenAI API call: {str(e)}") + raise + + def get_model_info(self) -> Dict[str, str]: + """Get information about the configured models.""" + return { + "solver_model": self.solver_model, + "grader_model": self.grader_model, + "provider": "openai" + } + + async def health_check(self) -> bool: + """ + Perform a simple health check to verify API connectivity. + + Returns: + True if API is accessible, False otherwise + """ + try: + # Simple test call + test_messages = [ + {"role": "user", "content": "Hello, please respond with a simple JSON: {\"status\": \"ok\"}"} + ] + + # Set temperature based on model + # o1, o3, o3-mini, and o4-mini require temperature 1.0 + if any(model_name in self.solver_model.lower() for model_name in ['o1', 'o3', 'o3-mini', 'o4-mini']): + temperature = 1.0 + else: + # Use temperature 0.0 for deterministic results with other models + temperature = 0.0 + + result, _ = await self._call_api( + model=self.solver_model, + messages=test_messages, + temperature=temperature + ) + + if result and "ok" in result.lower(): + if self.debug: + print(f"✅ OpenAI API health check passed for {self.solver_model}") + return True + else: + if self.debug: + print(f"⚠️ OpenAI API health check returned unexpected response") + return False + + except Exception as e: + if self.debug: + print(f"❌ OpenAI API health check failed: {str(e)}") + return False + + async def estimate_cost(self, + num_problems: int, + avg_problem_length: int = 1000, + avg_solution_length: int = 2000) -> Dict[str, float]: + """ + Estimate the cost for processing a given number of problems. + + Args: + num_problems: Number of problems to process + avg_problem_length: Average length of problem statements in characters + avg_solution_length: Average length of solutions in characters + + Returns: + Dictionary with cost estimates + """ + # Rough token estimates (1 token ≈ 4 characters for English) + tokens_per_solve = (avg_problem_length + avg_solution_length) // 4 + tokens_per_grade = (avg_problem_length + avg_solution_length * 2) // 4 + + # Simplified pricing (update with actual OpenAI pricing) + # These are rough estimates and should be updated with current pricing + pricing = { + "gpt-4o-mini": {"input": 0.00015, "output": 0.0006}, # per 1K tokens + "o3": {"input": 0.03, "output": 0.12}, # per 1K tokens (estimated) + "gpt-4": {"input": 0.03, "output": 0.06}, # per 1K tokens + } + + def get_model_cost(model: str, input_tokens: int, output_tokens: int) -> float: + if model not in pricing: + model = "gpt-4" # Default fallback + + input_cost = (input_tokens / 1000) * pricing[model]["input"] + output_cost = (output_tokens / 1000) * pricing[model]["output"] + return input_cost + output_cost + + # Calculate costs + solve_cost = get_model_cost( + self.solver_model, + tokens_per_solve * num_problems, + tokens_per_solve * num_problems // 2 # Assume output is ~50% of input + ) + + grade_cost = get_model_cost( + self.grader_model, + tokens_per_grade * num_problems, + tokens_per_grade * num_problems // 3 # Assume output is ~33% of input + ) + + total_cost = solve_cost + grade_cost + + return { + "solve_cost": round(solve_cost, 4), + "grade_cost": round(grade_cost, 4), + "total_cost": round(total_cost, 4), + "cost_per_problem": round(total_cost / num_problems, 6), + "currency": "USD" + } + + +class KimiModelLoader(OpenAIModelLoader): + """Kimi/Moonshot implementation using OpenAI-compatible API.""" + + def __init__(self, + solver_model: str = "kimi-k2-0711-preview", + grader_model: str = "kimi-k2-0711-preview", + api_key: Optional[str] = None, + **kwargs): + """ + Initialize Kimi model loader. + + Args: + solver_model: Kimi model for solving problems (default: moonshot-v1-8k) + grader_model: Kimi model for grading solutions (default: moonshot-v1-8k) + api_key: Kimi API key (if None, uses MOONSHOT_API_KEY environment variable) + **kwargs: Additional arguments passed to parent class + """ + # Get API key from parameter or environment + if api_key is None: + api_key = os.getenv('MOONSHOT_API_KEY') + + # Initialize with Kimi-specific settings + super().__init__( + solver_model=solver_model, + grader_model=grader_model, + api_key=api_key, + base_url="https://api.moonshot.ai/v1", + **kwargs + ) + + async def _call_api(self, + model: str, + messages: List[Dict[str, str]], + temperature: float = 0.0) -> Tuple[Optional[str], str]: + """ + Make an API call to Kimi with proper error handling. + + Args: + model: Kimi model name + messages: List of messages in chat format + temperature: Temperature for generation + + Returns: + Tuple of (response_content, raw_response) + """ + import time + + start_time = time.time() + if self.debug: + print(f"🔄 Starting Kimi API call with model: {model}") + + try: + # Prepare API call parameters + api_params = { + "model": model, + "messages": messages, + "temperature": temperature, + "response_format": RESPONSE_FORMAT, # Kimi supports JSON format + } + + # Set max_tokens based on model + if "128k" in model: + api_params["max_tokens"] = 32000 # For 128k context models + elif "32k" in model: + api_params["max_tokens"] = 16000 # For 32k context models + elif "8k" in model: + api_params["max_tokens"] = 8000 # For 8k context models + elif "k2" in model.lower(): + api_params["max_tokens"] = 24000 # For K2 models + else: + api_params["max_tokens"] = 16000 # Default high limit + + if self.debug: + print(f"📋 API call parameters: model={model}, messages={len(messages)}, temp={temperature}, max_tokens={api_params['max_tokens']}") + + # Make the API call + response = await self.client.chat.completions.create(**api_params) + + elapsed_time = time.time() - start_time + if self.debug: + print(f"✅ Kimi API call completed in {elapsed_time:.2f}s") + + # Extract response content + content = response.choices[0].message.content or "" + if self.debug: + print(f"📄 Response length: {len(content)} characters") + + # Check if response might be truncated + if self.debug and hasattr(response, 'usage'): + completion_tokens = response.usage.completion_tokens + print(f"📊 Completion tokens used: {completion_tokens}") + if completion_tokens >= api_params['max_tokens'] * 0.95: # 95% of limit + print(f"⚠️ WARNING: Response may be truncated (used {completion_tokens}/{api_params['max_tokens']} tokens)") + + # Check if content ends abruptly (truncation signs) + if self.debug and content and not content.strip().endswith(('"}', '"}')): + print("⚠️ WARNING: Response doesn't end with proper JSON closure - likely truncated") + + # ============= RAW RESPONSE LOGGING (DEBUG ONLY) ============= + if self.debug: + import json + from pathlib import Path + from datetime import datetime + + # Create raw response log directory + log_dir = Path("kimi_raw_responses") + log_dir.mkdir(exist_ok=True) + + # Save raw response + timestamp = datetime.now().strftime('%Y%m%d_%H%M%S_%f')[:-3] # Include milliseconds + raw_log_file = log_dir / f"kimi_raw_response_{timestamp}.json" + + raw_response_data = { + "timestamp": datetime.now().isoformat(), + "model": model, + "api_params": api_params, + "response_time_seconds": elapsed_time, + "raw_content": content, + "content_length": len(content), + "response_object": { + "choices": [ + { + "message": { + "content": content, + "role": response.choices[0].message.role + } + } + ] + } + } + + try: + with open(raw_log_file, 'w', encoding='utf-8') as f: + json.dump(raw_response_data, f, indent=2, ensure_ascii=False) + print(f"💾 Raw response saved to: {raw_log_file}") + except Exception as save_error: + print(f"❌ Failed to save raw response: {save_error}") + + # Also print raw content to console + print(f"📋 RAW RESPONSE CONTENT:") + print(f"{'='*60}") + print(content[:1000] + ("..." if len(content) > 1000 else "")) + print(f"{'='*60}") + # ============= END RAW RESPONSE LOGGING ============= + + return content, content + + except RateLimitError as e: + elapsed_time = time.time() - start_time + error_str = str(e) + if self.debug: + print(f"🚫 Kimi RateLimitError after {elapsed_time:.2f}s: {error_str}") + + # Try to capture response details + if self.debug and hasattr(e, 'response') and e.response: + print(f" Status: {e.response.status_code}") + print(f" Headers: {dict(e.response.headers)}") + print(f" Response: {e.response.text[:500]}...") + + if "insufficient_quota" in error_str: + if self.debug: + print("⏳ Detected Kimi quota exhaustion - sleeping 15 minutes") + await asyncio.sleep(900) # 15 minutes + else: + # Standard rate limit - shorter sleep + sleep_time = 2 + random.random() + if self.debug: + print(f" ⏰ Rate limited on Kimi API, sleeping {sleep_time:.1f}s") + await asyncio.sleep(sleep_time) + + # Re-raise to trigger retry logic + raise + + except (APIError, APIConnectionError) as e: + elapsed_time = time.time() - start_time + error_str = str(e) + if self.debug: + print(f"❌ Kimi API Error after {elapsed_time:.2f}s: {error_str}") + + # Try to capture response details + if self.debug and hasattr(e, 'response') and e.response: + print(f" Status: {e.response.status_code}") + print(f" Headers: {dict(e.response.headers)}") + print(f" Response: {e.response.text[:500]}...") + + # Log request details for debugging + if self.debug and hasattr(e, 'request') and e.request: + print(f" Request URL: {e.request.url}") + print(f" Request method: {e.request.method}") + print(f" Request headers: {dict(e.request.headers)}") + + raise + + except Exception as e: + elapsed_time = time.time() - start_time + error_str = str(e) + if self.debug: + print(f"❌ Unexpected error in Kimi API call after {elapsed_time:.2f}s: {error_str}") + print(f" Error type: {type(e).__name__}") + + # Try to capture any additional error details + if self.debug and hasattr(e, 'response'): + try: + print(f" Response status: {e.response.status_code}") + print(f" Response headers: {dict(e.response.headers)}") + print(f" Response text: {e.response.text[:500]}...") + except: + print(" Could not extract response details") + + # Log the full exception + if self.debug: + import traceback + print(f" Full traceback: {traceback.format_exc()}") + + raise + + def get_model_info(self) -> Dict[str, str]: + """Get information about the configured models.""" + return { + "solver_model": self.solver_model, + "grader_model": self.grader_model, + "provider": "kimi", + "base_url": "https://api.moonshot.ai/v1" + } + + async def health_check(self) -> bool: + """ + Perform a simple health check to verify Kimi API connectivity. + + Returns: + True if API is accessible, False otherwise + """ + try: + # Simple test call with Kimi's system prompt + test_messages = [ + {"role": "system", "content": "You are Kimi, an AI assistant provided by Moonshot AI. You are proficient in Chinese and English conversations. You provide users with safe, helpful, and accurate answers. You will reject any questions involving terrorism, racism, or explicit content. Moonshot AI is a proper noun and should not be translated."}, + {"role": "user", "content": "Hello, please respond with a simple JSON: {\"status\": \"ok\"}"} + ] + + result, _ = await self._call_api( + model=self.solver_model, + messages=test_messages, + temperature=0.0 + ) + + if result and "ok" in result.lower(): + if self.debug: + print(f"✅ Kimi API health check passed for {self.solver_model}") + return True + else: + if self.debug: + print(f"⚠️ Kimi API health check returned unexpected response") + return False + + except Exception as e: + if self.debug: + print(f"❌ Kimi API health check failed: {str(e)}") + return False + + async def estimate_cost(self, + num_problems: int, + avg_problem_length: int = 1000, + avg_solution_length: int = 2000) -> Dict[str, float]: + """ + Estimate the cost for processing a given number of problems with Kimi models. + + Args: + num_problems: Number of problems to process + avg_problem_length: Average length of problem statements in characters + avg_solution_length: Average length of solutions in characters + + Returns: + Dictionary with cost estimates + """ + # Rough token estimates (1 token ≈ 4 characters for English) + tokens_per_solve = (avg_problem_length + avg_solution_length) // 4 + tokens_per_grade = (avg_problem_length + avg_solution_length * 2) // 4 + + # Kimi pricing (in USD per 1K tokens) + # These are example prices - update with actual Kimi pricing + pricing = { + "moonshot-v1-8k": {"input": 0.012, "output": 0.012}, + "moonshot-v1-32k": {"input": 0.024, "output": 0.024}, + "moonshot-v1-128k": {"input": 0.06, "output": 0.06}, + } + + def get_model_cost(model: str, input_tokens: int, output_tokens: int) -> float: + if model not in pricing: + model = "moonshot-v1-8k" # Default to 8k pricing + + input_cost = (input_tokens / 1000) * pricing[model]["input"] + output_cost = (output_tokens / 1000) * pricing[model]["output"] + return input_cost + output_cost + + # Calculate costs + solve_cost = get_model_cost( + self.solver_model, + tokens_per_solve * num_problems, + tokens_per_solve * num_problems // 2 # Assume output is ~50% of input + ) + + grade_cost = get_model_cost( + self.grader_model, + tokens_per_grade * num_problems, + tokens_per_grade * num_problems // 4 # Grading output is shorter + ) + + return { + "solver_cost": solve_cost, + "grader_cost": grade_cost, + "total_cost": solve_cost + grade_cost, + "num_problems": num_problems, + "solver_model": self.solver_model, + "grader_model": self.grader_model + } diff --git a/putnam-bench-anon/loader/openrouter_client.py b/putnam-bench-anon/loader/openrouter_client.py new file mode 100644 index 0000000..13cd7fa --- /dev/null +++ b/putnam-bench-anon/loader/openrouter_client.py @@ -0,0 +1,213 @@ +""" +OpenRouter model loader implementation. +Handles API calls to OpenRouter service using OpenAI-compatible interface. +OpenRouter provides access to multiple model providers through a single API. +""" + +import os +from typing import Dict, Optional, List, Tuple + +from .openai_client import OpenAIModelLoader + + +class OpenRouterModelLoader(OpenAIModelLoader): + """OpenRouter implementation using OpenAI-compatible API.""" + + def __init__(self, + solver_model: str = "openai/gpt-4o", + grader_model: str = "openai/gpt-4o", + api_key: Optional[str] = None, + site_url: Optional[str] = None, + site_name: Optional[str] = None, + **kwargs): + """ + Initialize OpenRouter model loader. + + Args: + solver_model: Model for solving problems (default: openai/gpt-4o) + Format should be "provider/model-name" (e.g., "openai/gpt-4o", "anthropic/claude-3-opus") + grader_model: Model for grading solutions (default: openai/gpt-4o) + Format should be "provider/model-name" + api_key: OpenRouter API key (if None, uses OPENROUTER_API_KEY environment variable) + site_url: Optional site URL for rankings on openrouter.ai + site_name: Optional site name for rankings on openrouter.ai + **kwargs: Additional arguments passed to parent class + """ + # Get API key from parameter or environment + if api_key is None: + api_key = os.getenv('OPENROUTER_API_KEY') + if not api_key: + raise ValueError("OpenRouter API key not provided. Set OPENROUTER_API_KEY environment variable or pass api_key parameter") + + # Store site information for headers + self.site_url = site_url + self.site_name = site_name + + # Initialize with OpenRouter-specific settings + super().__init__( + solver_model=solver_model, + grader_model=grader_model, + api_key=api_key, + base_url="https://openrouter.ai/api/v1", + **kwargs + ) + + async def _call_api(self, + model: str, + messages: List[Dict[str, str]], + temperature: float = 0.0) -> Tuple[Optional[str], str]: + """ + Make an API call to OpenRouter with proper headers. + + Args: + model: Model name in format "provider/model-name" + messages: List of messages in chat format + temperature: Temperature for generation + + Returns: + Tuple of (response_content, raw_response) + """ + try: + # Prepare extra headers for OpenRouter + extra_headers = {} + if self.site_url: + extra_headers["HTTP-Referer"] = self.site_url + if self.site_name: + extra_headers["X-Title"] = self.site_name + + # Prepare API call parameters + api_params = { + "model": model, + "messages": messages, + "temperature": temperature, + # Set max_tokens to avoid truncation, especially for models like Gemini + # 32000 is a reasonable default that works for most models + "max_tokens": 32000, + } + + # Add response_format for all models - OpenRouter handles compatibility + from .prompts import RESPONSE_FORMAT + api_params["response_format"] = RESPONSE_FORMAT + + # Make the API call with extra headers + if extra_headers: + response = await self.client.chat.completions.create( + **api_params, + extra_headers=extra_headers + ) + else: + response = await self.client.chat.completions.create(**api_params) + + # Check if response is valid + if not response or not response.choices or len(response.choices) == 0: + raise ValueError("Empty response from OpenRouter API") + + content = response.choices[0].message.content + if not content: + raise ValueError("Empty content in OpenRouter API response") + + return content, content + + except Exception as e: + # Replace "OpenAI" with "OpenRouter" in error messages + error_msg = str(e) + if "OpenAI API Error" in error_msg: + error_msg = error_msg.replace("OpenAI API Error", "OpenRouter API Error") + + # Log with OpenRouter-specific prefix + if "RateLimitError" in type(e).__name__: + print(f"🚫 OpenRouter RateLimitError: {error_msg}") + raise + elif "APIError" in type(e).__name__ or "APIConnectionError" in type(e).__name__: + print(f"❌ OpenRouter API Error: {error_msg}") + raise + else: + print(f"❌ Unexpected error in OpenRouter API call: {error_msg}") + raise + + def get_model_info(self) -> Dict[str, str]: + """Get information about the configured models.""" + return { + "solver_model": self.solver_model, + "grader_model": self.grader_model, + "provider": "openrouter", + "base_url": "https://openrouter.ai/api/v1" + } + + async def health_check(self) -> bool: + """ + Perform a simple health check to verify OpenRouter API connectivity. + + Returns: + True if API is accessible, False otherwise + """ + try: + # Simple test call + test_messages = [ + {"role": "user", "content": "Hello, please respond with a simple JSON: {\"status\": \"ok\"}"} + ] + + result, _ = await self._call_api( + self.solver_model, + test_messages, + temperature=0.0 + ) + + return result is not None + + except Exception as e: + print(f"❌ OpenRouter health check failed: {e}") + return False + + @staticmethod + def get_available_models() -> List[str]: + """ + Get a list of commonly available models on OpenRouter. + Note: This is not exhaustive. Check https://openrouter.ai/models for full list. + + Returns: + List of model identifiers in "provider/model-name" format + """ + return [ + # OpenAI models + "openai/gpt-4o", + "openai/gpt-4o-mini", + "openai/gpt-4-turbo", + "openai/gpt-3.5-turbo", + "openai/o1-preview", + "openai/o1-mini", + + # Anthropic models + "anthropic/claude-3-opus", + "anthropic/claude-3-sonnet", + "anthropic/claude-3-haiku", + "anthropic/claude-2.1", + "anthropic/claude-2", + + # Google models + "google/gemini-pro", + "google/gemini-pro-vision", + "google/palm-2-codechat-bison", + "google/palm-2-chat-bison", + + # Meta models + "meta-llama/llama-3-70b-instruct", + "meta-llama/llama-3-8b-instruct", + "meta-llama/codellama-70b-instruct", + + # Mistral models + "mistralai/mistral-large", + "mistralai/mistral-medium", + "mistralai/mistral-small", + "mistralai/mistral-7b-instruct", + "mistralai/mixtral-8x7b-instruct", + + # Other notable models + "cohere/command-r-plus", + "cohere/command-r", + "databricks/dbrx-instruct", + "deepseek/deepseek-coder", + "deepseek/deepseek-chat", + "qwen/qwen-2-72b-instruct", + "qwen/qwen-1.5-110b-chat", + ]
\ No newline at end of file diff --git a/putnam-bench-anon/loader/prompts.py b/putnam-bench-anon/loader/prompts.py new file mode 100644 index 0000000..7f1be83 --- /dev/null +++ b/putnam-bench-anon/loader/prompts.py @@ -0,0 +1,106 @@ +""" +Prompt templates for mathematical problem solving and grading. +These prompts have been refined and validated through extensive testing. +""" + +# Solver system prompt - 4o-mini +SOLVER_SYSTEM_PROMPT = """You are an expert mathematician solving competition-level problems. +Provide detailed, step-by-step solutions with clear mathematical reasoning. + +Requirements: +- Show all your work and intermediate steps +- Justify each major step of your reasoning +- Use proper mathematical notation +- Be thorough but concise +- State your final answer clearly + +Solve the problem completely and rigorously.""" + +SOLVER_USER_TEMPLATE = """Please solve this mathematical problem: + +{problem_statement} + +Provide a complete solution with detailed reasoning. Return your response in JSON format: +{{"solution": "your complete step-by-step solution with mathematical reasoning", + "final_answer": "your final answer in a clear, concise form"}}""" + +# Proof strict grading system prompt - o3 +PROOF_GRADER_SYSTEM_PROMPT = """You are an extremely strict mathematical grader evaluating competition-level PROOF problems. + +GRADING STANDARDS (BE VERY STRICT): +- Mathematical rigor: Every step must be mathematically sound and justified +- Logical flow: The reasoning must be clear, complete, and logically connected +- Correctness: All calculations, algebraic manipulations, and conclusions must be correct +- Completeness: The solution must address all parts of the problem fully +- Precision: Mathematical statements must be precise and unambiguous + +FAILING CRITERIA (Mark as INCORRECT if ANY of these apply): +- Any unjustified logical leap or gap in reasoning +- Any computational error, no matter how small +- Missing steps in critical parts of the argument +- Imprecise or ambiguous mathematical statements +- Incorrect final answer, even if approach is partially correct +- Circular reasoning or logical fallacies +- Misuse of mathematical theorems or definitions + +BE EXTREMELY STRICT. Competition mathematics proofs require perfect precision.""" + +# Calculation lenient grading system prompt - o3 +CALCULATION_GRADER_SYSTEM_PROMPT = """You are a mathematical grader evaluating competition-level CALCULATION problems. + +GRADING STANDARDS FOR CALCULATION PROBLEMS: +- Primary focus: Is the final answer correct? +- Secondary focus: Is the overall approach reasonable and mathematically sound? +- Computation: Allow minor computational slips if the method is correct and final answer is right + +GRADING CRITERIA: +- CORRECT: Final answer is correct AND approach is fundamentally sound +- INCORRECT: Final answer is wrong OR approach is fundamentally flawed + +For calculation problems, the final numerical answer is the most important criterion. +Minor intermediate errors are acceptable if they don't affect the final result.""" + +PROOF_GRADER_USER_TEMPLATE = """Grade this PROOF solution with extreme strictness. + +PROBLEM: +{problem_statement} + +STUDENT SOLUTION: +{solution} + +CORRECT REFERENCE SOLUTION: +{reference_solution} + +Evaluate with maximum strictness. Every logical step must be perfect. Return JSON with: +{{"grade": "CORRECT" or "INCORRECT", + "detailed_feedback": "specific detailed analysis of what is right/wrong", + "major_issues": "list of significant mathematical errors or gaps", + "final_answer_correct": true or false, + "reasoning_rigor_score": 0-10 integer (10=perfect rigor, 0=severely flawed), + "overall_assessment": "comprehensive evaluation summary"}}""" + +CALCULATION_GRADER_USER_TEMPLATE = """Grade this CALCULATION solution with focus on final answer correctness. + +PROBLEM: +{problem_statement} + +STUDENT SOLUTION: +{solution} + +CORRECT REFERENCE SOLUTION: +{reference_solution} + +Focus primarily on whether the final answer is correct. Return JSON with: +{{"grade": "CORRECT" or "INCORRECT", + "detailed_feedback": "specific detailed analysis of what is right/wrong", + "major_issues": "list of significant mathematical errors or gaps", + "final_answer_correct": true or false, + "reasoning_rigor_score": 0-10 integer (10=perfect rigor, 0=severely flawed), + "overall_assessment": "comprehensive evaluation summary"}}""" + +# Response format for JSON output +RESPONSE_FORMAT = {"type": "json_object"} + +# Default retry and timeout settings +DEFAULT_RETRIES = 6 # Limited to 6 retries before marking as failed +DEFAULT_TIMEOUT_BASE = 600
\ No newline at end of file diff --git a/putnam-bench-anon/loader/vllm_direct.py b/putnam-bench-anon/loader/vllm_direct.py new file mode 100644 index 0000000..b35d99b --- /dev/null +++ b/putnam-bench-anon/loader/vllm_direct.py @@ -0,0 +1,313 @@ +""" +VLLM direct Python API model loader implementation. +Uses VLLM's Python API directly without requiring a separate server process. +""" + +import asyncio +import json +import re +from typing import Dict, List, Tuple, Optional, Any +import torch + +try: + from vllm import LLM, SamplingParams + VLLM_AVAILABLE = True +except ImportError: + LLM = None + SamplingParams = None + VLLM_AVAILABLE = False + +from .base import ModelLoader +from .prompts import SOLVER_SYSTEM_PROMPT, PROOF_GRADER_SYSTEM_PROMPT + + +class VLLMDirectModelLoader(ModelLoader): + """VLLM direct Python API implementation of the ModelLoader.""" + + def __init__(self, + solver_model: str = "gpt2", + grader_model: str = "gpt2", + max_model_len: int = 512, + gpu_memory_utilization: float = 0.4, + device: str = "auto", + **kwargs): + """ + Initialize VLLM direct model loader. + + Args: + solver_model: Model name for solving problems (default: gpt2) + grader_model: Model name for grading solutions (default: gpt2) + max_model_len: Maximum sequence length (default: 512 for testing) + gpu_memory_utilization: GPU memory utilization ratio (default: 0.4) + device: Device to use ('auto', 'cuda', 'cpu') + **kwargs: Additional arguments passed to parent class + """ + if not VLLM_AVAILABLE: + raise ImportError( + "vllm package is required for VLLMDirectModelLoader. " + "Install with: pip install vllm" + ) + + super().__init__(solver_model, grader_model, **kwargs) + + self.max_model_len = max_model_len + self.gpu_memory_utilization = gpu_memory_utilization + self.device = device + + # Model instances (lazy loaded) + self._solver_llm = None + self._grader_llm = None + self._loaded_models = [] + + print(f"🔧 VLLM Direct loader initialized") + print(f" Device: {device}") + print(f" Max length: {max_model_len}") + print(f" GPU utilization: {gpu_memory_utilization}") + + def _get_vllm_config(self, model: str) -> Dict[str, Any]: + """Get VLLM configuration for a model.""" + return { + "model": model, + "max_model_len": self.max_model_len, + "gpu_memory_utilization": self.gpu_memory_utilization, + "trust_remote_code": False, + "enforce_eager": True, # Disable graph optimization for faster startup + } + + async def _load_model(self, model: str, purpose: str) -> LLM: + """Load a VLLM model instance.""" + print(f"📥 Loading {purpose} model: {model}") + + try: + config = self._get_vllm_config(model) + llm = LLM(**config) + + self._loaded_models.append(model) + print(f"✅ Model loaded successfully: {model}") + return llm + + except Exception as e: + print(f"❌ Failed to load model {model}: {e}") + raise + + async def _get_solver_model(self) -> LLM: + """Get or load the solver model.""" + if self._solver_llm is None: + self._solver_llm = await self._load_model(self.solver_model, "solver") + return self._solver_llm + + async def _get_grader_model(self) -> LLM: + """Get or load the grader model.""" + if self._grader_llm is None: + # If solver and grader use the same model, reuse the instance + if self.solver_model == self.grader_model and self._solver_llm is not None: + print(f"♻️ Reusing solver model for grading: {self.grader_model}") + self._grader_llm = self._solver_llm + else: + self._grader_llm = await self._load_model(self.grader_model, "grader") + return self._grader_llm + + def _format_messages_as_prompt(self, messages: List[Dict[str, str]]) -> str: + """Convert chat messages to a single prompt string.""" + prompt_parts = [] + + for message in messages: + role = message["role"] + content = message["content"] + + if role == "system": + prompt_parts.append(f"System: {content}") + elif role == "user": + prompt_parts.append(f"User: {content}") + elif role == "assistant": + prompt_parts.append(f"Assistant: {content}") + + # Add final assistant prompt + if not messages[-1]["role"] == "assistant": + prompt_parts.append("Assistant:") + + return "\n\n".join(prompt_parts) + + def _extract_json_from_response(self, response: str) -> Optional[Dict]: + """Extract JSON from model response.""" + try: + # Try to find JSON in the response + json_match = re.search(r'\{.*\}', response, re.DOTALL) + if json_match: + json_str = json_match.group() + return json.loads(json_str) + + # If no JSON found, try to parse the entire response + return json.loads(response.strip()) + + except json.JSONDecodeError: + # If JSON parsing fails, return None + return None + + async def _call_api(self, + model: str, + messages: List[Dict[str, str]], + temperature: float = 0.0) -> Tuple[Optional[str], str]: + """ + Make an inference call using VLLM. + + Args: + model: Model name to use + messages: List of messages in chat format + temperature: Temperature for generation + + Returns: + Tuple of (response_content, raw_response) + """ + try: + # Get the appropriate model instance + if model == self.solver_model: + llm = await self._get_solver_model() + elif model == self.grader_model: + llm = await self._get_grader_model() + else: + raise ValueError(f"Unknown model: {model}") + + # Convert messages to prompt + prompt = self._format_messages_as_prompt(messages) + + # Set up sampling parameters + sampling_params = SamplingParams( + temperature=temperature, + top_p=0.95, + max_tokens=500, # Reasonable limit for responses + stop=["\nUser:", "\nSystem:"] # Stop at new conversation turns + ) + + # Generate response + outputs = llm.generate([prompt], sampling_params) + + if outputs and len(outputs) > 0: + generated_text = outputs[0].outputs[0].text + return generated_text.strip(), generated_text + else: + return None, "" + + except Exception as e: + print(f"❌ VLLM inference error: {str(e)}") + raise + + def get_model_info(self) -> Dict[str, str]: + """Get information about the configured models.""" + return { + "solver_model": self.solver_model, + "grader_model": self.grader_model, + "provider": "vllm_direct", + "device": self.device, + "loaded_models": self._loaded_models + } + + async def health_check(self) -> bool: + """ + Perform a simple health check to verify VLLM functionality. + + Returns: + True if models can be loaded and generate text, False otherwise + """ + try: + print(f"🔍 VLLM health check starting...") + + # Try to load and use the solver model + test_messages = [ + {"role": "user", "content": "Hello! Please respond with 'Health check OK'."} + ] + + result, _ = await self._call_api( + model=self.solver_model, + messages=test_messages, + temperature=0.1 + ) + + if result and len(result) > 0: + print(f"✅ VLLM health check passed for {self.solver_model}") + print(f" Response: {result[:50]}...") + return True + else: + print(f"❌ VLLM health check failed: empty response") + return False + + except Exception as e: + print(f"❌ VLLM health check failed: {str(e)}") + return False + + async def estimate_cost(self, + num_problems: int, + avg_problem_length: int = 1000, + avg_solution_length: int = 2000) -> Dict[str, float]: + """ + Estimate the cost for processing a given number of problems. + For direct VLLM, cost is computational (time/energy). + + Args: + num_problems: Number of problems to process + avg_problem_length: Average length of problem statements in characters + avg_solution_length: Average length of solutions in characters + + Returns: + Dictionary with cost estimates + """ + # Token estimates (1 token ≈ 4 characters) + tokens_per_solve = (avg_problem_length + avg_solution_length) // 4 + tokens_per_grade = (avg_problem_length + avg_solution_length * 2) // 4 + + # Model size cost factors (based on parameter count) + model_costs = { + "gpt2": 1.0, # 124M params + "distilgpt2": 0.5, # 82M params + "microsoft/dialo": 1.2, # DialoGPT variants + "tinyllama": 2.0, # 1.1B params + } + + def get_model_cost(model: str) -> float: + model_lower = model.lower() + for key, cost in model_costs.items(): + if key in model_lower: + return cost + return 1.5 # Default cost + + solver_cost_factor = get_model_cost(self.solver_model) + grader_cost_factor = get_model_cost(self.grader_model) + + # Computational cost estimation (arbitrary units) + solve_cost = tokens_per_solve * num_problems * solver_cost_factor / 10000 + grade_cost = tokens_per_grade * num_problems * grader_cost_factor / 10000 + + total_cost = solve_cost + grade_cost + + return { + "solve_cost": round(solve_cost, 4), + "grade_cost": round(grade_cost, 4), + "total_cost": round(total_cost, 4), + "cost_per_problem": round(total_cost / num_problems, 6), + "currency": "computational_units", + "note": "Direct VLLM costs are computational (GPU time/energy)" + } + + async def unload_all_models(self): + """Unload all loaded models to free GPU memory.""" + try: + print("🗑️ Unloading VLLM models...") + + # Clean up model instances + if self._solver_llm is not None: + del self._solver_llm + self._solver_llm = None + + if self._grader_llm is not None and self._grader_llm != self._solver_llm: + del self._grader_llm + self._grader_llm = None + + # Clear CUDA cache + if torch.cuda.is_available(): + torch.cuda.empty_cache() + + self._loaded_models.clear() + print("✅ Models unloaded successfully") + + except Exception as e: + print(f"⚠️ Error during model cleanup: {e}")
\ No newline at end of file diff --git a/putnam-bench-anon/loader/vllm_local.py b/putnam-bench-anon/loader/vllm_local.py new file mode 100644 index 0000000..bc8c4fb --- /dev/null +++ b/putnam-bench-anon/loader/vllm_local.py @@ -0,0 +1,224 @@ +""" +VLLM local model loader implementation. +Handles API calls to locally deployed VLLM services with OpenAI-compatible endpoints. +""" + +import asyncio +import random +from typing import Dict, List, Tuple, Optional + +try: + from openai import AsyncOpenAI, RateLimitError, APIError, APIConnectionError +except ImportError: + AsyncOpenAI = None + RateLimitError = Exception + APIError = Exception + APIConnectionError = Exception + +from .base import ModelLoader +from .prompts import RESPONSE_FORMAT + + +class VLLMModelLoader(ModelLoader): + """VLLM local model implementation of the ModelLoader.""" + + def __init__(self, + solver_model: str = "meta-llama/Llama-3.2-3B-Instruct", + grader_model: str = "meta-llama/Llama-3.2-8B-Instruct", + base_url: str = "http://localhost:8000/v1", + api_key: str = "EMPTY", + **kwargs): + """ + Initialize VLLM model loader. + + Args: + solver_model: Model name for solving problems (default: Llama-3.2-3B-Instruct) + grader_model: Model name for grading solutions (default: Llama-3.2-8B-Instruct) + base_url: VLLM server URL (default: http://localhost:8000/v1) + api_key: API key for VLLM server (default: "EMPTY" for local) + **kwargs: Additional arguments passed to parent class + """ + if AsyncOpenAI is None: + raise ImportError( + "openai package is required for VLLMModelLoader. " + "Install with: pip install openai" + ) + + super().__init__(solver_model, grader_model, **kwargs) + + # Initialize OpenAI-compatible client for VLLM + self.client = AsyncOpenAI( + base_url=base_url, + api_key=api_key + ) + self.base_url = base_url + + async def _call_api(self, + model: str, + messages: List[Dict[str, str]], + temperature: float = 0.0) -> Tuple[Optional[str], str]: + """ + Make an API call to VLLM server. + + Args: + model: Model name to use + messages: List of messages in chat format + temperature: Temperature for generation + + Returns: + Tuple of (response_content, raw_response) + """ + try: + # Prepare API call parameters + api_params = { + "model": model, + "messages": messages, + "temperature": temperature, + "max_tokens": 4000, + } + + # Only add response_format for models that support it + # Most local models may not support structured JSON output + if temperature == 0.0: + try: + api_params["response_format"] = RESPONSE_FORMAT + except: + # If JSON format is not supported, we'll parse manually + pass + + # Make the API call + response = await self.client.chat.completions.create(**api_params) + + # Extract response content + content = response.choices[0].message.content or "" + + return content, content + + except (RateLimitError, APIError, APIConnectionError) as e: + # Handle various API errors + error_str = str(e) + print(f"❌ VLLM API Error: {error_str}") + + if "rate" in error_str.lower() or "limit" in error_str.lower(): + sleep_time = 2 + random.random() + print(f" ⏰ Rate limited, sleeping {sleep_time:.1f}s") + await asyncio.sleep(sleep_time) + + # Re-raise to trigger retry logic + raise + + except Exception as e: + print(f"❌ Unexpected error in VLLM API call: {str(e)}") + raise + + def get_model_info(self) -> Dict[str, str]: + """Get information about the configured models.""" + return { + "solver_model": self.solver_model, + "grader_model": self.grader_model, + "provider": "vllm", + "base_url": self.base_url + } + + async def health_check(self) -> bool: + """ + Perform a simple health check to verify VLLM server connectivity. + + Returns: + True if server is accessible, False otherwise + """ + try: + # Simple test call + test_messages = [ + {"role": "user", "content": "Hello, please respond with a simple JSON: {\"status\": \"ok\"}"} + ] + + result, _ = await self._call_api( + model=self.solver_model, + messages=test_messages, + temperature=0.0 + ) + + if result and ("ok" in result.lower() or "hello" in result.lower()): + print(f"✅ VLLM API health check passed for {self.solver_model}") + return True + else: + print(f"⚠️ VLLM API health check returned unexpected response") + return False + + except Exception as e: + print(f"❌ VLLM API health check failed: {str(e)}") + print(f" Make sure VLLM server is running at {self.base_url}") + return False + + async def estimate_cost(self, + num_problems: int, + avg_problem_length: int = 1000, + avg_solution_length: int = 2000) -> Dict[str, float]: + """ + Estimate the cost for processing a given number of problems. + For local VLLM, cost is typically computational (time/energy) rather than monetary. + + Args: + num_problems: Number of problems to process + avg_problem_length: Average length of problem statements in characters + avg_solution_length: Average length of solutions in characters + + Returns: + Dictionary with cost estimates (computational cost in arbitrary units) + """ + # Rough token estimates (1 token ≈ 4 characters for English) + tokens_per_solve = (avg_problem_length + avg_solution_length) // 4 + tokens_per_grade = (avg_problem_length + avg_solution_length * 2) // 4 + + # Computational cost estimation (arbitrary units based on model size) + # Larger models consume more computational resources + model_costs = { + "llama-3.2-1b": 1.0, + "llama-3.2-3b": 2.0, + "llama-3.2-8b": 4.0, + "llama-3.1-8b": 4.0, + "llama-3.1-70b": 20.0, + "mistral-7b": 3.0, + "qwen2.5-7b": 3.0, + } + + def get_model_cost(model: str) -> float: + model_lower = model.lower() + for key, cost in model_costs.items(): + if key in model_lower: + return cost + return 3.0 # Default cost for unknown models + + # Calculate computational costs + solver_cost_factor = get_model_cost(self.solver_model) + grader_cost_factor = get_model_cost(self.grader_model) + + solve_cost = tokens_per_solve * num_problems * solver_cost_factor / 1000 + grade_cost = tokens_per_grade * num_problems * grader_cost_factor / 1000 + + total_cost = solve_cost + grade_cost + + return { + "solve_cost": round(solve_cost, 4), + "grade_cost": round(grade_cost, 4), + "total_cost": round(total_cost, 4), + "cost_per_problem": round(total_cost / num_problems, 6), + "currency": "computational_units", + "note": "Local VLLM costs are computational (time/energy) rather than monetary" + } + + async def list_models(self) -> List[str]: + """ + List available models on the VLLM server. + + Returns: + List of available model names + """ + try: + # Try to get models list from VLLM server + models_response = await self.client.models.list() + return [model.id for model in models_response.data] + except Exception as e: + print(f"⚠️ Could not retrieve models list: {str(e)}") + return [self.solver_model, self.grader_model] diff --git a/putnam-bench-anon/loader/xai_client.py b/putnam-bench-anon/loader/xai_client.py new file mode 100644 index 0000000..10c4cf4 --- /dev/null +++ b/putnam-bench-anon/loader/xai_client.py @@ -0,0 +1,173 @@ +""" +xAI model loader implementation. +Handles API calls to xAI Grok models using OpenAI-compatible interface. +""" + +import os +from typing import Dict, Optional, List, Tuple + +from .openai_client import OpenAIModelLoader + + +class XAIModelLoader(OpenAIModelLoader): + """xAI implementation using OpenAI-compatible API.""" + + def __init__(self, + solver_model: str = "grok-3", + grader_model: str = "grok-3", + api_key: Optional[str] = None, + **kwargs): + """ + Initialize xAI model loader. + + Args: + solver_model: xAI model for solving problems (default: grok-3) + grader_model: xAI model for grading solutions (default: grok-3) + api_key: xAI API key (if None, uses XAI_API_KEY environment variable) + **kwargs: Additional arguments passed to parent class + """ + # Get API key from parameter or environment + if api_key is None: + api_key = os.getenv('XAI_API_KEY') + + # Initialize with xAI-specific settings + super().__init__( + solver_model=solver_model, + grader_model=grader_model, + api_key=api_key, + base_url="https://api.x.ai/v1", + **kwargs + ) + + async def _call_api(self, + model: str, + messages: List[Dict[str, str]], + temperature: float = 0.0) -> Tuple[Optional[str], str]: + """ + Make an API call to xAI with proper error handling. + + Args: + model: xAI model name + messages: List of messages in chat format + temperature: Temperature for generation + + Returns: + Tuple of (response_content, raw_response) + """ + try: + # Call parent's implementation + return await super()._call_api(model, messages, temperature) + + except Exception as e: + # Replace "OpenAI" with "xAI" in error messages + error_msg = str(e) + if "OpenAI API Error" in error_msg: + error_msg = error_msg.replace("OpenAI API Error", "xAI API Error") + + # Log with xAI-specific prefix + if "RateLimitError" in type(e).__name__: + print(f"🚫 xAI RateLimitError: {error_msg}") + raise + elif "APIError" in type(e).__name__ or "APIConnectionError" in type(e).__name__: + print(f"❌ xAI API Error: {error_msg}") + raise + else: + print(f"❌ Unexpected error in xAI API call: {error_msg}") + raise + + def get_model_info(self) -> Dict[str, str]: + """Get information about the configured models.""" + return { + "solver_model": self.solver_model, + "grader_model": self.grader_model, + "provider": "xai", + "base_url": "https://api.x.ai/v1" + } + + async def health_check(self) -> bool: + """ + Perform a simple health check to verify xAI API connectivity. + + Returns: + True if API is accessible, False otherwise + """ + try: + # Simple test call + test_messages = [ + {"role": "user", "content": "Hello, please respond with a simple JSON: {\"status\": \"ok\"}"} + ] + + result, _ = await self._call_api( + model=self.solver_model, + messages=test_messages, + temperature=0.0 + ) + + if result and "ok" in result.lower(): + print(f"✅ xAI API health check passed for {self.solver_model}") + return True + else: + print(f"⚠️ xAI API health check returned unexpected response") + return False + + except Exception as e: + print(f"❌ xAI API health check failed: {str(e)}") + return False + + async def estimate_cost(self, + num_problems: int, + avg_problem_length: int = 1000, + avg_solution_length: int = 2000) -> Dict[str, float]: + """ + Estimate the cost for processing a given number of problems with xAI models. + + Args: + num_problems: Number of problems to process + avg_problem_length: Average length of problem statements in characters + avg_solution_length: Average length of solutions in characters + + Returns: + Dictionary with cost estimates + """ + # Rough token estimates (1 token ≈ 4 characters for English) + tokens_per_solve = (avg_problem_length + avg_solution_length) // 4 + tokens_per_grade = (avg_problem_length + avg_solution_length * 2) // 4 + + # xAI pricing (update with actual pricing when available) + # These are estimates based on similar model pricing + pricing = { + "grok-3": {"input": 0.01, "output": 0.03}, # per 1K tokens (estimated) + "grok-2": {"input": 0.005, "output": 0.015}, # per 1K tokens (estimated) + } + + def get_model_cost(model: str, input_tokens: int, output_tokens: int) -> float: + if model not in pricing: + model = "grok-3" # Default to grok-3 pricing + + input_cost = (input_tokens / 1000) * pricing[model]["input"] + output_cost = (output_tokens / 1000) * pricing[model]["output"] + return input_cost + output_cost + + # Calculate costs + solve_cost = get_model_cost( + self.solver_model, + tokens_per_solve * num_problems, + tokens_per_solve * num_problems // 2 # Assume output is ~50% of input + ) + + grade_cost = get_model_cost( + self.grader_model, + tokens_per_grade * num_problems, + tokens_per_grade * num_problems // 3 # Assume output is ~33% of input + ) + + total_cost = solve_cost + grade_cost + + return { + "solve_cost": round(solve_cost, 4), + "grade_cost": round(grade_cost, 4), + "total_cost": round(total_cost, 4), + "cost_per_problem": round(total_cost / num_problems, 6), + "currency": "USD", + "note": "xAI pricing estimates - update with actual pricing" + }
\ No newline at end of file diff --git a/putnam-bench-anon/putnam_cli.py b/putnam-bench-anon/putnam_cli.py new file mode 100644 index 0000000..59ca5d3 --- /dev/null +++ b/putnam-bench-anon/putnam_cli.py @@ -0,0 +1,813 @@ +#!/usr/bin/env python3 +""" +Putnam CLI - Simple command-line interface for mathematical problem solving. + +This CLI provides easy-to-use commands for testing problems, checking health, +running benchmarks, and managing the system. + +Usage: + putnam solve problem.json # Solve a single problem + putnam test --provider openai # Quick test + putnam health # Check all providers + putnam benchmark --quick # Quick benchmark + putnam batch dataset/ --provider anthropic # Batch evaluation + +Cross-provider usage: + putnam solve problem.json --solver-provider kimi --grader-provider openai + putnam batch dataset/ --solver-provider kimi --grader-provider openai +""" + +import asyncio +import json +import sys +from pathlib import Path +import argparse +from typing import Dict, Any, Optional +import os + +# Add the loader module to the path +sys.path.append(str(Path(__file__).parent)) + +from loader import create_loader, create_cross_provider_loader, get_supported_providers, get_default_models + + +class PutnamCLI: + """Main CLI class for Putnam problem solver.""" + + def __init__(self): + self.verbose = False + + def print_banner(self): + """Print CLI banner.""" + print("🧮 Putnam Mathematical Problem Solver CLI") + print("=" * 50) + + def print_providers(self): + """Print available providers.""" + print("\n🤖 Available Providers:") + for provider in get_supported_providers(): + defaults = get_default_models(provider) + print(f" • {provider.upper()}") + print(f" Solver: {defaults['solver_model']}") + print(f" Grader: {defaults['grader_model']}") + print() + + def _create_loader(self, args, loader_kwargs: Optional[Dict] = None) -> Any: + """ + Create a loader based on command-line arguments. + Handles both single-provider and cross-provider scenarios. + + Args: + args: Command-line arguments + loader_kwargs: Additional kwargs for loader creation + + Returns: + ModelLoader instance + """ + loader_kwargs = loader_kwargs or {} + + # Add debug flag if available + if hasattr(args, 'debug') and args.debug: + loader_kwargs['debug'] = True + + # Handle provider-specific settings + if hasattr(args, 'vllm_url') and args.vllm_url: + if args.provider == 'vllm' or (hasattr(args, 'solver_provider') and args.solver_provider == 'vllm'): + loader_kwargs['solver_kwargs'] = loader_kwargs.get('solver_kwargs', {}) + loader_kwargs['solver_kwargs']['base_url'] = args.vllm_url + if hasattr(args, 'grader_provider') and args.grader_provider == 'vllm': + loader_kwargs['grader_kwargs'] = loader_kwargs.get('grader_kwargs', {}) + loader_kwargs['grader_kwargs']['base_url'] = args.vllm_url + + if hasattr(args, 'device') and args.device: + if args.provider == 'huggingface' or (hasattr(args, 'solver_provider') and args.solver_provider == 'huggingface'): + loader_kwargs['solver_kwargs'] = loader_kwargs.get('solver_kwargs', {}) + loader_kwargs['solver_kwargs']['device'] = args.device + if hasattr(args, 'grader_provider') and args.grader_provider == 'huggingface': + loader_kwargs['grader_kwargs'] = loader_kwargs.get('grader_kwargs', {}) + loader_kwargs['grader_kwargs']['device'] = args.device + + # Check if we're using cross-provider mode + if hasattr(args, 'solver_provider') and args.solver_provider: + # Cross-provider mode + print(f"🚀 Using solver provider: {args.solver_provider}") + if hasattr(args, 'grader_provider') and args.grader_provider: + print(f"🎯 Using grader provider: {args.grader_provider}") + else: + print(f"🎯 Using grader provider: {args.solver_provider} (same as solver)") + + return create_cross_provider_loader( + solver_provider=args.solver_provider, + grader_provider=args.grader_provider if hasattr(args, 'grader_provider') else None, + solver_model=args.solver_model if hasattr(args, 'solver_model') else None, + grader_model=args.grader_model if hasattr(args, 'grader_model') else None, + **loader_kwargs + ) + else: + # Single provider mode (backward compatibility) + provider = args.provider if hasattr(args, 'provider') else "openai" + print(f"🚀 Using provider: {provider}") + + # Handle special cases for single provider + if provider == 'vllm' and hasattr(args, 'vllm_url'): + loader_kwargs['base_url'] = args.vllm_url + elif provider == 'huggingface' and hasattr(args, 'device'): + loader_kwargs['device'] = args.device + + return create_loader( + provider, + solver_model=args.solver_model if hasattr(args, 'solver_model') else None, + grader_model=args.grader_model if hasattr(args, 'grader_model') else None, + **loader_kwargs + ) + + async def cmd_solve(self, args) -> int: + """Solve a single problem.""" + self.print_banner() + + # Setup logging + import logging + from datetime import datetime + from pathlib import Path + + # Create log file + log_dir = Path("solve_logs") + log_dir.mkdir(exist_ok=True) + timestamp = datetime.now().strftime('%Y%m%d_%H%M%S') + log_file = log_dir / f"solve_debug_{timestamp}.log" + + # Setup logging + logging.basicConfig( + level=logging.INFO, + format='%(asctime)s - %(levelname)s - %(message)s', + handlers=[ + logging.FileHandler(log_file), + logging.StreamHandler() + ] + ) + logger = logging.getLogger(__name__) + + logger.info(f"🔍 Starting solve command, log file: {log_file}") + + # Load problem + try: + with open(args.problem_file, 'r', encoding='utf-8') as f: + problem_data = json.load(f) + logger.info(f"📁 Problem loaded from {args.problem_file}") + except Exception as e: + logger.error(f"❌ Error loading problem: {str(e)}") + return 1 + + # Setup provider + loader = self._create_loader(args) + logger.info(f"🤖 Created loader: solver={loader.solver_model}, grader={loader.grader_model}") + + # Health check + print("🔍 Checking provider health...") + if not await loader.health_check(): + logger.error("❌ Provider health check failed") + return 1 + + # Show problem + variant_type = args.variant or "original" + problem_stmt = problem_data.get(variant_type, {}).get('problem_statement', 'N/A') + logger.info(f"📝 Problem variant: {variant_type}") + logger.info(f"📄 Problem statement: {problem_stmt[:500]}...") + + print(f"\n📝 Problem ({variant_type}):") + print(f" {problem_stmt[:200]}{'...' if len(problem_stmt) > 200 else ''}") + + # Solve + print(f"\n⚡ Solving with {loader.solver_model}...") + logger.info(f"🔄 Starting solve process...") + + result = await loader.test_single_problem( + problem_data, + variant_type=variant_type, + solver_model=args.solver_model, + grader_model=args.grader_model + ) + + # Log detailed results + logger.info("📊 DETAILED RESULTS:") + logger.info(f" Full result: {json.dumps(result, indent=2, ensure_ascii=False)}") + + # Analyze solve step + solve_data = result.get('solve', {}) + solve_status = solve_data.get('status', 'unknown') + logger.info(f"🔍 SOLVE ANALYSIS:") + logger.info(f" Status: {solve_status}") + + if solve_status == 'success': + solution = solve_data.get('solution', 'N/A') + logger.info(f" Solution length: {len(solution)} characters") + logger.info(f" Solution preview: {solution[:200]}...") + else: + error_msg = solve_data.get('error', 'No error message') + logger.error(f" Solve error: {error_msg}") + + # Analyze grade step + grade_data = result.get('grade', {}) + grade_status = grade_data.get('status', 'unknown') + logger.info(f"🔍 GRADE ANALYSIS:") + logger.info(f" Status: {grade_status}") + + if grade_status == 'success': + grade = grade_data.get('grade', 'N/A') + feedback = grade_data.get('detailed_feedback', 'N/A') + logger.info(f" Grade: {grade}") + logger.info(f" Feedback: {feedback}") + else: + error_msg = grade_data.get('error', 'No error message') + logger.error(f" Grade error: {error_msg}") + + # Show results + print(f"\n✅ Solution completed!") + + # Extract and display grade + if result.get('grade', {}).get('status') == 'success': + grade = result.get('grade', {}).get('grade', 'N/A') + is_correct = result.get('correct', False) + grade_display = f"{grade} ({'✓' if is_correct else '✗'})" + else: + grade_display = 'N/A (grading failed)' + + # Extract and display solution + if result.get('solve', {}).get('status') == 'success': + solution = result.get('solve', {}).get('solution', 'N/A') + else: + solution = 'N/A (solving failed)' + + print(f"🎯 Final Grade: {grade_display}") + print(f"🤖 Solution:") + print(f" {solution[:300]}{'...' if len(solution) > 300 else ''}") + + if args.verbose: + grading = result.get('grade', {}) + print(f"\n📊 Grading Details:") + print(f" Feedback: {grading.get('detailed_feedback', 'N/A')[:200]}...") + print(f" Major Issues: {grading.get('major_issues', 'N/A')}") + print(f" Rigor Score: {grading.get('reasoning_rigor_score', 'N/A')}") + + # Save detailed results + results_file = log_dir / f"solve_results_{timestamp}.json" + with open(results_file, 'w', encoding='utf-8') as f: + json.dump(result, f, indent=2, ensure_ascii=False) + logger.info(f"💾 Detailed results saved to {results_file}") + + # Save if requested + if args.output: + with open(args.output, 'w', encoding='utf-8') as f: + json.dump(result, f, indent=2, ensure_ascii=False) + print(f"💾 Results saved to {args.output}") + + print(f"\n📋 Log file created: {log_file}") + print(f"📋 Results file created: {results_file}") + + return 0 + + async def cmd_test(self, args) -> int: + """Quick test of a provider.""" + self.print_banner() + + # Create simple test problem + test_problem = { + 'question': 'Calculate 15 + 27.', + 'solution': 'The answer is 42.', + 'problem_type': 'calculation' + } + + try: + loader = self._create_loader(args) + + print("🔍 Health check...") + if not await loader.health_check(): + print("❌ Health check failed") + return 1 + + print("⚡ Running test problem...") + result = await loader.test_single_problem(test_problem, variant_type='original') + + print(f"✅ Test completed!") + + # Extract grade information + if result.get('grade', {}).get('status') == 'success': + grade = result.get('grade', {}).get('grade', 'N/A') + is_correct = result.get('correct', False) + grade_display = f"{grade} ({'✓' if is_correct else '✗'})" + else: + grade_display = 'N/A (grading failed)' + + # Extract solution + if result.get('solve', {}).get('status') == 'success': + solution = result.get('solve', {}).get('solution', 'N/A') + else: + solution = 'N/A (solving failed)' + + print(f"🎯 Grade: {grade_display}") + print(f"🤖 Solution: {solution[:100]}...") + + return 0 + + except Exception as e: + print(f"❌ Test failed: {str(e)}") + return 1 + + async def cmd_health(self, args) -> int: + """Check health of providers.""" + self.print_banner() + + print("🏥 Checking provider health...") + + # Import health check + try: + from scripts.health_check import HealthChecker + checker = HealthChecker(detailed=args.detailed) + + results = await checker.check_all_providers(args.provider) + + # Simple summary + summary = results['summary'] + print(f"\n📋 Summary: {summary['healthy_providers']}/{summary['total_providers']} providers healthy") + + return 0 if summary['healthy_providers'] > 0 else 1 + + except ImportError: + print("❌ Health check module not available") + return 1 + except Exception as e: + print(f"❌ Health check failed: {str(e)}") + return 1 + + async def cmd_benchmark(self, args) -> int: + """Run benchmark.""" + self.print_banner() + + print("🏁 Running benchmark...") + + try: + from scripts.benchmark import run_quick_test + await run_quick_test() + return 0 + + except ImportError: + print("❌ Benchmark module not available") + return 1 + except Exception as e: + print(f"❌ Benchmark failed: {str(e)}") + return 1 + + async def cmd_batch(self, args) -> int: + """Run batch evaluation.""" + self.print_banner() + + # Handle resume case - simplified version + if args.resume: + if not args.resume.exists(): + print(f"❌ Resume checkpoint file not found: {args.resume}") + return 1 + + # Simple resume: just read completed problems list + print(f"📂 Resuming from checkpoint: {args.resume}") + with open(args.resume) as f: + checkpoint_data = json.load(f) + + # Extract completed problem indices + completed_indices = checkpoint_data.get('completed_indices', []) + print(f" Found {len(completed_indices)} completed problems to skip") + + # Still need dataset path for resume + if not args.dataset_path: + # Try to get from checkpoint for convenience + dataset_path = checkpoint_data.get('dataset_path') + if dataset_path: + dataset_path = Path(dataset_path) + print(f" Using dataset path from checkpoint: {dataset_path}") + else: + print("❌ Dataset path is required when resuming") + return 1 + else: + dataset_path = Path(args.dataset_path) + else: + # New evaluation + if not args.dataset_path: + print("❌ Dataset path is required for new batch evaluation.") + return 1 + dataset_path = Path(args.dataset_path) + if not dataset_path.exists(): + print(f"❌ Dataset path not found: {dataset_path}") + return 1 + + try: + # Import batch evaluation functions + from scripts.batch_evaluate import batch_evaluate, batch_evaluate_cross + + # Check if we need to run all variants + if args.variant == "all" and not args.resume: + # All available variants + all_variants = ["original", "descriptive_long", "descriptive_long_confusing", + "descriptive_long_misleading", "garbled_string", "kernel_variant"] + + print(f"🔄 Running all {len(all_variants)} variants sequentially...") + + overall_results = [] + for i, variant in enumerate(all_variants, 1): + print(f"\n{'='*60}") + print(f"📍 Variant {i}/{len(all_variants)}: {variant}") + print(f"{'='*60}") + + # Determine output file for this variant + if args.output: + # If output specified, append variant name + output_path = Path(args.output) + output_file = output_path.parent / f"{output_path.stem}_{variant}{output_path.suffix}" + else: + output_file = None + + # Run batch evaluation for this variant + if hasattr(args, 'solver_provider') and args.solver_provider: + # Cross-provider batch evaluation + results = await batch_evaluate_cross( + dataset_path=dataset_path, + solver_provider=args.solver_provider, + grader_provider=args.grader_provider if hasattr(args, 'grader_provider') else args.solver_provider, + variant_type=variant, + max_concurrent=args.concurrent or 3, + max_files=args.max_files, + solver_model=args.solver_model, + grader_model=args.grader_model, + output_file=output_file, + resume_checkpoint=args.resume, + vllm_url=args.vllm_url if hasattr(args, 'vllm_url') else None, + device=args.device if hasattr(args, 'device') else None, + quick=args.quick if hasattr(args, 'quick') else False + ) + else: + # Standard batch evaluation + loader_kwargs = {} + provider = args.provider or "openai" + if provider == 'vllm' and hasattr(args, 'vllm_url'): + loader_kwargs['base_url'] = args.vllm_url + elif provider == 'huggingface' and hasattr(args, 'device'): + loader_kwargs['device'] = args.device + + # Add quick mode if specified + if hasattr(args, 'quick') and args.quick: + loader_kwargs['quick'] = True + + results = await batch_evaluate( + dataset_path=dataset_path, + provider=provider, + variant_type=variant, + max_concurrent=args.concurrent or 3, + max_files=args.max_files, + solver_model=args.solver_model, + grader_model=args.grader_model, + output_file=output_file, + resume_checkpoint=args.resume, + **loader_kwargs + ) + + print(f"✅ {variant} completed!") + print(f"📊 Average grade: {results['summary']['average_grade']:.2f}") + print(f"📈 Success rate: {results['summary']['success_rate']:.1f}%") + + overall_results.append({ + 'variant': variant, + 'summary': results['summary'] + }) + + # Wait between variants to ensure clean state + if i < len(all_variants): + print("\n⏳ Waiting 5 seconds before next variant...") + await asyncio.sleep(5) + + # Print overall summary + print(f"\n{'='*60}") + print("📊 OVERALL SUMMARY") + print(f"{'='*60}") + + for result in overall_results: + variant = result['variant'] + summary = result['summary'] + print(f"{variant:20s}: Grade {summary['average_grade']:5.2f}, Success {summary['success_rate']:5.1f}%") + + return 0 + else: + # Single variant evaluation + if hasattr(args, 'solver_provider') and args.solver_provider: + # Cross-provider batch evaluation + results = await batch_evaluate_cross( + dataset_path=dataset_path, + solver_provider=args.solver_provider, + grader_provider=args.grader_provider if hasattr(args, 'grader_provider') else args.solver_provider, + variant_type=args.variant or "original", + max_concurrent=args.concurrent or 3, + max_files=args.max_files, + solver_model=args.solver_model, + grader_model=args.grader_model, + output_file=Path(args.output) if args.output else None, + resume_checkpoint=args.resume, + vllm_url=args.vllm_url if hasattr(args, 'vllm_url') else None, + device=args.device if hasattr(args, 'device') else None, + quick=args.quick if hasattr(args, 'quick') else False + ) + else: + # Standard batch evaluation + loader_kwargs = {} + provider = args.provider or "openai" + if provider == 'vllm' and hasattr(args, 'vllm_url'): + loader_kwargs['base_url'] = args.vllm_url + elif provider == 'huggingface' and hasattr(args, 'device'): + loader_kwargs['device'] = args.device + + # Add quick mode if specified + if hasattr(args, 'quick') and args.quick: + loader_kwargs['quick'] = True + + results = await batch_evaluate( + dataset_path=dataset_path, + provider=provider, + variant_type=args.variant or "original", + max_concurrent=args.concurrent or 3, + max_files=args.max_files, + solver_model=args.solver_model, + grader_model=args.grader_model, + output_file=Path(args.output) if args.output else None, + resume_checkpoint=args.resume, + **loader_kwargs + ) + + print(f"✅ Batch evaluation completed!") + print(f"📊 Average grade: {results['summary']['average_grade']:.2f}") + print(f"📈 Success rate: {results['summary']['success_rate']:.1f}%") + + return 0 + + except ImportError: + print("❌ Batch evaluation module not available") + return 1 + except Exception as e: + print(f"❌ Batch evaluation failed: {str(e)}") + return 1 + + async def cmd_multi_test(self, args) -> int: + """Run multi-variant testing.""" + self.print_banner() + + provider = args.provider or "openai" + print(f"🎯 Multi-variant testing with {provider}") + + try: + from scripts.batch_evaluate import batch_evaluate_all_variants + + # Run multi-variant evaluation + results = await batch_evaluate_all_variants( + dataset_path=Path(args.dataset_path or "dataset"), + provider=provider, + variants=args.variants, + max_concurrent=args.concurrent or 3, + max_files=args.max_files, + solver_model=args.solver_model, + grader_model=args.grader_model, + output_dir=Path(args.output_dir or "multi_variant_results"), + base_url=args.vllm_url if provider == 'vllm' else None, + device=args.device if provider == 'huggingface' else None + ) + + print(f"✅ Multi-variant testing completed!") + metrics = results['aggregate_metrics'] + print(f"📊 Overall average grade: {metrics['overall_average_grade']:.2f}") + print(f"📈 Overall success rate: {metrics['overall_success_rate']:.1f}%") + print(f"⏱️ Total time: {results['test_overview']['total_test_time_minutes']:.1f} minutes") + + comparison = results['variant_comparison'] + if comparison['best_performing_variant']['variant']: + print(f"🏆 Best variant: {comparison['best_performing_variant']['variant']} " + f"(Grade: {comparison['best_performing_variant']['grade']:.2f})") + + return 0 + + except ImportError: + print("❌ Multi-variant testing module not available") + return 1 + except Exception as e: + print(f"❌ Multi-variant testing failed: {str(e)}") + return 1 + + async def cmd_info(self, args) -> int: + """Show system information.""" + self.print_banner() + + print("ℹ️ System Information") + print("-" * 30) + + # Check environment variables + print("🔧 Environment Variables:") + env_vars = [ + 'OPENAI_API_KEY', + 'ANTHROPIC_API_KEY', + 'GOOGLE_API_KEY', + 'XAI_API_KEY', + 'MOONSHOT_API_KEY' + ] + for var in env_vars: + value = os.getenv(var) + status = "✅ Set" if value else "❌ Not set" + provider = var.replace('_API_KEY', '').replace('MOONSHOT', 'KIMI') + print(f" {provider}: {status}") + + print() + self.print_providers() + + # Show usage examples + print("💡 Quick Start Examples:") + print(" # Single provider:") + print(" putnam solve dataset/1938-A-1.json") + print(" putnam test --provider openai") + print(" putnam batch dataset/ --provider anthropic --max-files 5") + print("") + print(" # Cross-provider:") + print(" putnam solve dataset/1938-A-1.json --solver-provider kimi --grader-provider openai") + print(" putnam batch dataset/ --solver-provider kimi --grader-provider openai --concurrent 200") + print("") + print(" # Full test with all variants:") + print(" putnam batch dataset/ --variant all --solver-provider kimi --grader-provider openai") + print("") + print(" # Resume functionality:") + print(" putnam batch --resume checkpoint_file.json") + print(" putnam batch dataset/ --provider openai --resume old_checkpoint_file.json") + + return 0 + + +def create_parser(): + """Create argument parser.""" + parser = argparse.ArgumentParser( + description="Putnam Mathematical Problem Solver CLI", + formatter_class=argparse.RawDescriptionHelpFormatter, + epilog=""" +Examples: + Single provider: + putnam solve problem.json --provider openai + putnam test --provider anthropic + putnam batch dataset/ --provider gemini --max-files 10 + + Cross-provider: + putnam solve problem.json --solver-provider kimi --grader-provider openai + putnam batch dataset/ --solver-provider kimi --grader-provider openai --concurrent 200 + putnam batch dataset/ --variant all --solver-provider kimi --grader-provider openai + + Resume functionality: + putnam batch --resume checkpoint_file.json + putnam batch dataset/ --provider openai --resume old_checkpoint_file.json + """ + ) + + # Global options + parser.add_argument("--verbose", "-v", action="store_true", help="Verbose output") + parser.add_argument("--debug", action="store_true", help="Enable debug mode (show JSON parsing details)") + + # Subcommands + subparsers = parser.add_subparsers(dest="command", help="Available commands") + + # Solve command + solve_parser = subparsers.add_parser("solve", help="Solve a single problem") + solve_parser.add_argument("problem_file", type=Path, help="Problem JSON file") + solve_parser.add_argument("--provider", choices=get_supported_providers(), + help="AI provider (sets both solver and grader)") + solve_parser.add_argument("--solver-provider", choices=get_supported_providers(), + help="Provider for solving") + solve_parser.add_argument("--grader-provider", choices=get_supported_providers(), + + help="Provider for grading") + solve_parser.add_argument("--variant", choices=["original", "descriptive_long", "kernel_variant"], + help="Problem variant") + solve_parser.add_argument("--solver-model", help="Override solver model") + solve_parser.add_argument("--grader-model", help="Override grader model") + solve_parser.add_argument("--output", "-o", type=Path, help="Save results to file") + solve_parser.add_argument("--debug", action="store_true", help="Enable debug mode (show JSON parsing details)") + solve_parser.add_argument("--vllm-url", default="http://localhost:8000/v1", + help="VLLM server URL") + solve_parser.add_argument("--device", choices=["auto", "cuda", "cpu"], + help="Device for HuggingFace") + + # Test command + test_parser = subparsers.add_parser("test", help="Quick test of a provider") + test_parser.add_argument("--provider", choices=get_supported_providers(), + help="AI provider (sets both solver and grader)") + test_parser.add_argument("--solver-provider", choices=get_supported_providers(), + help="Provider for solving") + test_parser.add_argument("--grader-provider", choices=get_supported_providers(), + help="Provider for grading") + test_parser.add_argument("--vllm-url", default="http://localhost:8000/v1", help="VLLM server URL") + test_parser.add_argument("--device", choices=["auto", "cuda", "cpu"], help="Device for HuggingFace") + + # Health command + health_parser = subparsers.add_parser("health", help="Check provider health") + health_parser.add_argument("--provider", choices=get_supported_providers(), + help="Check specific provider only") + health_parser.add_argument("--detailed", action="store_true", help="Detailed health check") + + # Benchmark command + benchmark_parser = subparsers.add_parser("benchmark", help="Run performance benchmark") + benchmark_parser.add_argument("--quick", action="store_true", help="Quick benchmark") + benchmark_parser.add_argument("--config", type=Path, help="Configuration file") + + # Batch command + batch_parser = subparsers.add_parser("batch", help="Batch evaluation") + batch_parser.add_argument("dataset_path", type=Path, nargs='?', help="Dataset directory (required for new runs, optional for resume)") + batch_parser.add_argument("--provider", choices=get_supported_providers(), + help="AI provider (sets both solver and grader)") + batch_parser.add_argument("--solver-provider", choices=get_supported_providers(), + help="Provider for solving") + batch_parser.add_argument("--grader-provider", choices=get_supported_providers(), + help="Provider for grading") + batch_parser.add_argument("--variant", choices=["all", "original", "descriptive_long", "descriptive_long_confusing", + "descriptive_long_misleading", "garbled_string", "kernel_variant"], + help="Problem variant (use 'all' to run all variants sequentially)") + batch_parser.add_argument("--max-files", type=int, help="Maximum files to process") + batch_parser.add_argument("--concurrent", type=int, default=3, help="Concurrent evaluations") + batch_parser.add_argument("--solver-model", help="Override solver model") + batch_parser.add_argument("--grader-model", help="Override grader model") + batch_parser.add_argument("--output", "-o", help="Output file") + batch_parser.add_argument("--resume", type=Path, help="Resume from checkpoint file") + batch_parser.add_argument("--debug", action="store_true", help="Enable debug mode (show JSON parsing details)") + batch_parser.add_argument("--quick", action="store_true", help="Quick mode: allows one retry with 1200s timeout per attempt") + batch_parser.add_argument("--vllm-url", default="http://localhost:8000/v1", help="VLLM server URL") + batch_parser.add_argument("--device", choices=["auto", "cuda", "cpu"], help="Device for HuggingFace") + + # Multi-test command + multi_parser = subparsers.add_parser("multi-test", help="Run multi-variant testing") + multi_parser.add_argument("--provider", choices=get_supported_providers(), + help="AI provider (sets both solver and grader)") + multi_parser.add_argument("--solver-provider", choices=get_supported_providers(), + help="Provider for solving") + multi_parser.add_argument("--grader-provider", choices=get_supported_providers(), + help="Provider for grading") + multi_parser.add_argument("--dataset-path", type=Path, help="Dataset directory path") + multi_parser.add_argument("--variants", nargs="+", + choices=["original", "descriptive_long", "descriptive_long_confusing", + "descriptive_long_misleading", "garbled_string", "kernel_variant"], + help="Specific variants to test (default: all)") + multi_parser.add_argument("--max-files", type=int, help="Maximum files per variant") + multi_parser.add_argument("--concurrent", type=int, help="Maximum concurrent evaluations") + multi_parser.add_argument("--solver-model", help="Override solver model") + multi_parser.add_argument("--grader-model", help="Override grader model") + multi_parser.add_argument("--output-dir", type=Path, help="Output directory") + multi_parser.add_argument("--vllm-url", default="http://localhost:8000/v1", help="VLLM server URL") + multi_parser.add_argument("--device", choices=["auto", "cuda", "cpu"], help="Device for HuggingFace") + + # Info command + info_parser = subparsers.add_parser("info", help="Show system information") + + return parser + + +async def main(): + """Main CLI entry point.""" + parser = create_parser() + args = parser.parse_args() + + # Handle no command + if not args.command: + parser.print_help() + return 1 + + # Create CLI instance + cli = PutnamCLI() + cli.verbose = args.verbose + + # Route to appropriate command + try: + if args.command == "solve": + return await cli.cmd_solve(args) + elif args.command == "test": + return await cli.cmd_test(args) + elif args.command == "health": + return await cli.cmd_health(args) + elif args.command == "benchmark": + return await cli.cmd_benchmark(args) + elif args.command == "batch": + return await cli.cmd_batch(args) + elif args.command == "multi-test": + return await cli.cmd_multi_test(args) + elif args.command == "info": + return await cli.cmd_info(args) + else: + print(f"❌ Unknown command: {args.command}") + return 1 + + except KeyboardInterrupt: + print("\n⏸️ Operation interrupted by user") + return 1 + except Exception as e: + print(f"❌ Error: {str(e)}") + if cli.verbose: + import traceback + traceback.print_exc() + return 1 + + +if __name__ == "__main__": + exit(asyncio.run(main()))
\ No newline at end of file diff --git a/putnam-bench-anon/requirements-local.txt b/putnam-bench-anon/requirements-local.txt new file mode 100644 index 0000000..fbcab3f --- /dev/null +++ b/putnam-bench-anon/requirements-local.txt @@ -0,0 +1,27 @@ +# Requirements for local model support (GPU recommended) +# Base requirements +-r requirements-minimal.txt + +# PyTorch with CUDA support (install separately with specific CUDA version) +# pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124 + +# HuggingFace transformers ecosystem +transformers>=4.30.0 +accelerate>=0.20.0 +tokenizers>=0.13.0 + +# Optional: VLLM for high-performance inference +vllm>=0.3.0 + +# GPU monitoring and management +nvidia-ml-py3>=11.0.0 + +# Model quantization and optimization +bitsandbytes>=0.39.0 + +# Additional utilities for local models +safetensors>=0.3.0 +huggingface-hub>=0.16.0 + +# Progress bars +tqdm>=4.60.0
\ No newline at end of file diff --git a/putnam-bench-anon/requirements-minimal.txt b/putnam-bench-anon/requirements-minimal.txt new file mode 100644 index 0000000..84b34a9 --- /dev/null +++ b/putnam-bench-anon/requirements-minimal.txt @@ -0,0 +1,16 @@ +# Minimal requirements for basic functionality +# Core AI provider clients +openai>=1.0.0 +anthropic>=0.25.0 +google-generativeai>=0.3.0 + +# System utilities +psutil>=5.9.0 +six>=1.16.0 + +# Basic async support +aiohttp>=3.8.0 +requests>=2.25.0 + +# Progress bars +tqdm>=4.60.0
\ No newline at end of file diff --git a/putnam-bench-anon/requirements.txt b/putnam-bench-anon/requirements.txt new file mode 100644 index 0000000..d36dc4a --- /dev/null +++ b/putnam-bench-anon/requirements.txt @@ -0,0 +1,35 @@ +# Core AI provider clients +openai>=1.0.0 +anthropic>=0.25.0 +google-generativeai>=0.3.0 + +# Local model support +transformers>=4.30.0 +torch>=2.0.0 +accelerate>=0.20.0 + +# Optional: VLLM for local model serving +vllm>=0.3.0 + +# System utilities +psutil>=5.9.0 +six>=1.16.0 + +# Development and utility libraries +asyncio-throttle>=1.0.0 +aiohttp>=3.8.0 +requests>=2.25.0 + +# Data handling +numpy>=1.21.0 +pandas>=1.3.0 + +# JSON and data processing +jsonschema>=4.0.0 + +# Math and scientific computing +sympy>=1.9.0 +matplotlib>=3.5.0 + +# Progress bars and UI +tqdm>=4.60.0
\ No newline at end of file diff --git a/putnam-bench-anon/scripts/__init__.py b/putnam-bench-anon/scripts/__init__.py new file mode 100644 index 0000000..389f811 --- /dev/null +++ b/putnam-bench-anon/scripts/__init__.py @@ -0,0 +1 @@ +"""Scripts package for Putnam mathematical problem solver."""
\ No newline at end of file diff --git a/putnam-bench-anon/scripts/batch_evaluate.py b/putnam-bench-anon/scripts/batch_evaluate.py new file mode 100644 index 0000000..6fde90b --- /dev/null +++ b/putnam-bench-anon/scripts/batch_evaluate.py @@ -0,0 +1,1211 @@ +#!/usr/bin/env python3 +""" +Batch evaluation script for processing entire datasets with multiple providers. + +This script efficiently processes all JSON files in the dataset directory, +supports multiple AI providers, and generates comprehensive evaluation reports. + +Features: +- Incremental saving: Results are saved after each problem completes +- Simple resume support: Skip already completed problems based on checkpoint +- Multi-provider support +- Comprehensive evaluation reports + +Usage: + python batch_evaluate.py --provider openai --output results/openai_results.json + python batch_evaluate.py --provider anthropic --variant kernel_variant --max-concurrent 5 + +Resume usage (simplified): + # Resume with same configuration + python batch_evaluate.py --provider openai --dataset dataset/ --resume checkpoint_file.json + + # Resume with different settings (checkpoint only provides skip list) + python batch_evaluate.py --provider openai --dataset dataset/ --concurrent 10 --resume checkpoint_file.json +""" + +import asyncio +import json +import sys +import time +from pathlib import Path +import argparse +from typing import List, Dict, Any +import logging +from datetime import datetime +import shutil + +try: + from tqdm import tqdm + HAS_TQDM = True +except ImportError: + HAS_TQDM = False + # Fallback progress bar + class tqdm: + def __init__(self, total=None, desc=None, **kwargs): + self.total = total + self.n = 0 + self.desc = desc + print(f"{desc}: Starting...") + + def update(self, n=1): + self.n += n + if self.total: + percent = (self.n / self.total) * 100 + print(f"{self.desc}: {self.n}/{self.total} ({percent:.1f}%)", end='\r') + + def set_postfix(self, postfix_dict): + pass + + def close(self): + print() # New line after progress + +# Add the loader module to the path +sys.path.append(str(Path(__file__).parent)) + +from loader import create_loader, get_supported_providers + + +def setup_logging(output_dir: Path): + """Setup logging configuration.""" + log_file = output_dir / f"evaluation_{datetime.now().strftime('%Y%m%d_%H%M%S')}.log" + + logging.basicConfig( + level=logging.INFO, + format='%(asctime)s - %(levelname)s - %(message)s', + handlers=[ + logging.FileHandler(log_file), + logging.StreamHandler(sys.stdout) + ] + ) + + return logging.getLogger(__name__) + + +async def load_dataset(dataset_path: Path, max_files: int = None) -> List[Dict[str, Any]]: + """Load all JSON files from the dataset directory.""" + json_files = list(dataset_path.glob("*.json")) + + if max_files: + json_files = json_files[:max_files] + + problems = [] + for json_file in json_files: + try: + with open(json_file, 'r', encoding='utf-8') as f: + data = json.load(f) + data['_source_file'] = str(json_file.name) + problems.append(data) + except Exception as e: + logging.warning(f"Failed to load {json_file}: {str(e)}") + + return problems + + +async def process_single_problem(loader, problem_data: Dict[str, Any], + variant_type: str, solver_model: str = None, + grader_model: str = None) -> Dict[str, Any]: + """Process a single problem and return results with metadata.""" + start_time = time.time() + + try: + result = await loader.test_single_problem( + problem_data, + variant_type=variant_type, + solver_model=solver_model, + grader_model=grader_model + ) + + # Add metadata + result['_metadata'] = { + 'source_file': problem_data.get('_source_file', 'unknown'), + 'variant_type': variant_type, + 'processing_time': time.time() - start_time, + 'timestamp': datetime.now().isoformat(), + 'models_used': { + 'solver': solver_model or loader.solver_model, + 'grader': grader_model or loader.grader_model + } + } + + return result + + except Exception as e: + # Return error information + return { + 'error': str(e), + 'final_grade': 0, + '_metadata': { + 'source_file': problem_data.get('_source_file', 'unknown'), + 'variant_type': variant_type, + 'processing_time': time.time() - start_time, + 'timestamp': datetime.now().isoformat(), + 'error': True + } + } + + +async def batch_evaluate(dataset_path: Path = None, provider: str = None, variant_type: str = "original", + max_concurrent: int = 3, max_files: int = None, + solver_model: str = None, grader_model: str = None, + output_file: Path = None, resume_checkpoint: Path = None, + **loader_kwargs) -> Dict[str, Any]: + """ + Batch evaluate problems using specified provider with resume support. + + Args: + dataset_path: Path to dataset directory (required for new runs or old checkpoint format) + provider: AI provider name (required for new runs or old checkpoint format) + variant_type: Problem variant to use + max_concurrent: Maximum concurrent evaluations + max_files: Maximum number of files to process (None for all) + solver_model: Override solver model + grader_model: Override grader model + output_file: Output file path + resume_checkpoint: Path to checkpoint file to resume from + **loader_kwargs: Additional arguments for loader + + Returns: + Dictionary with evaluation results and statistics + """ + logger = logging.getLogger(__name__) + + # Check if resuming from checkpoint + if resume_checkpoint and resume_checkpoint.exists(): + logger.info(f"Resuming from checkpoint: {resume_checkpoint}") + with open(resume_checkpoint, 'r', encoding='utf-8') as f: + checkpoint_data = json.load(f) + + # Simple resume: just restore completed indices and results + completed_indices = set(checkpoint_data.get('completed_indices', [])) + results = checkpoint_data.get('results', []) + failed_indices = checkpoint_data.get('failed_indices', []) + successful_indices = checkpoint_data.get('successful_indices', []) + correct_indices = checkpoint_data.get('correct_indices', []) + + # Always require dataset_path and provider from command line + if not dataset_path: + raise ValueError("dataset_path is required when resuming") + if not provider: + raise ValueError("provider is required when resuming") + + # Load dataset + logger.info(f"Loading dataset from {dataset_path}") + problems = await load_dataset(dataset_path, max_files) + logger.info(f"Loaded {len(problems)} problems") + + if not problems: + raise ValueError("No problems found in dataset") + + checkpoint_file = resume_checkpoint # Continue using the same checkpoint file + logger.info(f"Resuming with {len(completed_indices)} completed problems out of {len(problems)}") + else: + # New evaluation - validate required parameters + if not dataset_path: + raise ValueError("dataset_path is required for new evaluation") + if not provider: + raise ValueError("provider is required for new evaluation") + + # Load dataset + logger.info(f"Loading dataset from {dataset_path}") + problems = await load_dataset(dataset_path, max_files) + logger.info(f"Loaded {len(problems)} problems") + + if not problems: + raise ValueError("No problems found in dataset") + + # Initialize state for new run + completed_indices = set() + results = [] + failed_indices = [] + successful_indices = [] + correct_indices = [] + + # Create checkpoint file name + timestamp = datetime.now().strftime('%Y%m%d_%H%M%S') + if output_file: + checkpoint_file = output_file.parent / f"checkpoint_{output_file.stem}_{timestamp}.json" + else: + checkpoint_file = Path(f"checkpoint_{provider}_{variant_type}_{timestamp}.json") + + # Create loader + logger.info(f"Creating {provider} loader") + + # Include solver_model and grader_model in loader_kwargs if specified + if solver_model: + loader_kwargs['solver_model'] = solver_model + if grader_model: + loader_kwargs['grader_model'] = grader_model + + loader = create_loader(provider, **loader_kwargs) + + # Health check + logger.info("Performing health check...") + if not await loader.health_check(): + raise RuntimeError(f"Health check failed for {provider}") + + # Cost estimation + logger.info("Estimating costs...") + cost_info = await loader.estimate_cost(len(problems)) + logger.info(f"Estimated cost: ${cost_info.get('total_cost', 0):.2f}") + + # Progress tracking + remaining_problems = [p for p in problems if p.get('index', 'unknown') not in completed_indices] + progress_bar = tqdm(total=len(problems), desc=f"Evaluating with {provider}", initial=len(completed_indices)) + + # Semaphore for concurrency control + semaphore = asyncio.Semaphore(max_concurrent) + + def save_checkpoint(): + """Save current state to checkpoint file - simplified version""" + checkpoint_data = { + 'timestamp': datetime.now().isoformat(), + # Only save essential state information + 'completed_indices': list(completed_indices), + 'successful_indices': successful_indices, + 'failed_indices': failed_indices, + 'correct_indices': correct_indices, + 'results': results, + # Save minimal config for reference (not for resume) + 'dataset_path': str(dataset_path), # For convenience + 'total_problems': len(problems), + 'current_config': { + 'provider': provider, + 'variant_type': variant_type, + 'solver_model': loader.solver_model, + 'grader_model': loader.grader_model + } + } + + # Write to temporary file first, then move (atomic operation) + temp_file = checkpoint_file.with_suffix('.tmp') + with open(temp_file, 'w', encoding='utf-8') as f: + json.dump(checkpoint_data, f, indent=2, ensure_ascii=False) + + # Atomic rename + temp_file.replace(checkpoint_file) + + async def evaluate_problem(problem_data: Dict[str, Any]) -> Dict[str, Any]: + """Evaluate a single problem with concurrency control.""" + problem_index = problem_data.get('index', 'unknown') + + # Skip if already completed + if problem_index in completed_indices: + return None + + async with semaphore: + try: + result = await loader.test_single_problem( + problem_data, + variant_type=variant_type + ) + + # Track success/failure based on technical completion, not correctness + if result.get('status') == 'completed': + successful_indices.append(result['index']) # Successfully processed + if result.get('correct'): + correct_indices.append(result['index']) # Also correct + else: + failed_indices.append(result['index']) # Technical failure + + # Add to results and mark as completed + results.append(result) + completed_indices.add(problem_index) + + # Save checkpoint immediately after each problem + save_checkpoint() + + progress_bar.update(1) + progress_bar.set_postfix({ + 'success': len(successful_indices), + 'failed': len(failed_indices), + 'saved': len(completed_indices) + }) + + return result + + except Exception as e: + logger.error(f"Error evaluating problem {problem_index}: {e}") + result = { + 'index': problem_index, + 'status': 'error', + 'error': str(e), + 'error_type': type(e).__name__ + } + + # Add to results and mark as completed (even if failed) + results.append(result) + failed_indices.append(problem_index) + completed_indices.add(problem_index) + + # Save checkpoint + save_checkpoint() + + progress_bar.update(1) + progress_bar.set_postfix({ + 'success': len(successful_indices), + 'failed': len(failed_indices), + 'saved': len(completed_indices) + }) + + return result + + # Run evaluations + start_time = time.time() + + try: + # Create tasks only for remaining problems + tasks = [evaluate_problem(problem) for problem in remaining_problems] + + if tasks: + # Execute all tasks concurrently (limited by semaphore) + await asyncio.gather(*tasks) + else: + logger.info("All problems already completed!") + + except KeyboardInterrupt: + logger.info("Evaluation interrupted by user. Progress saved to checkpoint.") + logger.info(f"To resume, use: --resume {checkpoint_file}") + raise + + finally: + progress_bar.close() + + # Calculate statistics + total_time = time.time() - start_time + completed_results = [r for r in results if r.get('status') == 'completed'] + grades = [r['grade']['grade'] for r in completed_results + if r.get('grade', {}).get('status') == 'success' and 'grade' in r.get('grade', {})] + + # Calculate numeric grades (CORRECT=5, INCORRECT=2.5) + numeric_grades = [5.0 if g == 'CORRECT' else 2.5 for g in grades] + average_grade = sum(numeric_grades) / len(numeric_grades) if numeric_grades else 0.0 + + summary = { + 'total_problems': len(problems), + 'completed': len(completed_results), + 'successful': len(successful_indices), # Technical success (completed processing) + 'failed': len(failed_indices), # Technical failures + 'correct_answers': len(correct_indices), # Mathematically correct answers + 'incorrect_answers': len(successful_indices) - len(correct_indices), # Wrong but processed + 'success_rate': (len(successful_indices) / len(problems) * 100) if problems else 0, # Technical success rate + 'accuracy_rate': (len(correct_indices) / len(successful_indices) * 100) if successful_indices else 0, # Correctness rate + 'average_grade': average_grade, + 'total_time_seconds': total_time, + 'problems_per_second': len(problems) / total_time if total_time > 0 else 0, + 'provider': provider, + 'variant_type': variant_type, + 'solver_model': loader.solver_model, + 'grader_model': loader.grader_model, + 'max_concurrent': max_concurrent, + 'estimated_cost': cost_info, + 'checkpoint_file': str(checkpoint_file) + } + + # Create full results + full_results = { + 'summary': summary, + 'problems': results, + 'successful_indices': successful_indices, # Technical successes + 'failed_indices': failed_indices, # Technical failures + 'correct_indices': correct_indices, # Correct answers + 'timestamp': datetime.now().isoformat() + } + + # Save final results + if output_file: + logger.info(f"Saving final results to {output_file}") + with open(output_file, 'w', encoding='utf-8') as f: + json.dump(full_results, f, indent=2, ensure_ascii=False) + + # Clean up checkpoint file after successful completion + if checkpoint_file.exists(): + logger.info(f"Removing checkpoint file: {checkpoint_file}") + checkpoint_file.unlink() + + # Print summary + logger.info(f"\n{'='*60}") + logger.info("EVALUATION SUMMARY") + logger.info(f"{'='*60}") + logger.info(f"Provider: {provider}") + logger.info(f"Variant: {variant_type}") + logger.info(f"Total problems: {summary['total_problems']}") + logger.info(f"✅ Successfully processed: {summary['successful']} ({summary['success_rate']:.1f}%)") + logger.info(f"💥 Technical failures: {summary['failed']}") + logger.info(f"🎯 Correct answers: {summary['correct_answers']} ({summary['accuracy_rate']:.1f}% of processed)") + logger.info(f"❌ Wrong answers: {summary['incorrect_answers']}") + logger.info(f"Average grade: {summary['average_grade']:.2f}") + logger.info(f"Total time: {summary['total_time_seconds']:.1f}s") + logger.info(f"Speed: {summary['problems_per_second']:.2f} problems/second") + + # Cleanup + if hasattr(loader, '__aexit__'): + await loader.__aexit__(None, None, None) + + return full_results + + +async def batch_evaluate_cross(dataset_path: Path = None, + solver_provider: str = None, + grader_provider: str = None, + variant_type: str = "original", + max_concurrent: int = 3, + max_files: int = None, + solver_model: str = None, + grader_model: str = None, + output_file: Path = None, + resume_checkpoint: Path = None, + vllm_url: str = None, + device: str = None, + quick: bool = False) -> Dict[str, Any]: + """ + Batch evaluate problems using different providers for solving and grading with resume support. + + Args: + dataset_path: Path to dataset directory (required for new runs, ignored for resume) + solver_provider: Provider for solving problems (required for new runs, ignored for resume) + grader_provider: Provider for grading (if None, uses solver_provider) + variant_type: Problem variant to use + max_concurrent: Maximum concurrent evaluations + max_files: Maximum number of files to process (None for all) + solver_model: Override solver model + grader_model: Override grader model + output_file: Output file path + resume_checkpoint: Path to checkpoint file to resume from + vllm_url: VLLM server URL if using VLLM + device: Device for HuggingFace models + + Returns: + Dictionary with evaluation results and statistics + """ + logger = logging.getLogger(__name__) + + # Check if resuming from checkpoint + if resume_checkpoint and resume_checkpoint.exists(): + logger.info(f"Resuming from checkpoint: {resume_checkpoint}") + with open(resume_checkpoint, 'r', encoding='utf-8') as f: + checkpoint_data = json.load(f) + + # Simple resume: just restore completed indices and results + completed_indices = set(checkpoint_data.get('completed_indices', [])) + results = checkpoint_data.get('results', []) + failed_indices = checkpoint_data.get('failed_indices', []) + successful_indices = checkpoint_data.get('successful_indices', []) + correct_indices = checkpoint_data.get('correct_indices', []) + + # Always require providers and dataset_path from command line + if not dataset_path: + raise ValueError("dataset_path is required when resuming") + if not solver_provider: + raise ValueError("solver_provider is required when resuming") + + # Load dataset + logger.info(f"Loading dataset from {dataset_path}") + problems = await load_dataset(dataset_path, max_files) + logger.info(f"Loaded {len(problems)} problems") + + if not problems: + raise ValueError("No problems found in dataset") + + checkpoint_file = resume_checkpoint # Continue using the same checkpoint file + logger.info(f"Resuming with {len(completed_indices)} completed problems out of {len(problems)}") + else: + # New evaluation - validate required parameters + if not dataset_path: + raise ValueError("dataset_path is required for new evaluation") + if not solver_provider: + raise ValueError("solver_provider is required for new evaluation") + + # Load dataset + logger.info(f"Loading dataset from {dataset_path}") + problems = await load_dataset(dataset_path, max_files) + logger.info(f"Loaded {len(problems)} problems") + + if not problems: + raise ValueError("No problems found in dataset") + + # Initialize state for new run + completed_indices = set() + results = [] + failed_indices = [] + successful_indices = [] + correct_indices = [] + + # Create checkpoint file name + timestamp = datetime.now().strftime('%Y%m%d_%H%M%S') + if output_file: + checkpoint_file = output_file.parent / f"checkpoint_{output_file.stem}_{timestamp}.json" + else: + checkpoint_file = Path(f"checkpoint_cross_{solver_provider}_{grader_provider or solver_provider}_{variant_type}_{timestamp}.json") + + # Create cross-provider loader + logger.info(f"Creating cross-provider loader: solver={solver_provider}, grader={grader_provider or solver_provider}") + + from loader import create_cross_provider_loader + + # Prepare kwargs for each provider + loader_kwargs = {} + + # VLLM settings + if vllm_url: + if solver_provider == 'vllm': + loader_kwargs['solver_kwargs'] = {'base_url': vllm_url} + if grader_provider == 'vllm': + loader_kwargs['grader_kwargs'] = {'base_url': vllm_url} + + # HuggingFace settings + if device: + if solver_provider == 'huggingface': + loader_kwargs['solver_kwargs'] = {'device': device} + if grader_provider == 'huggingface': + loader_kwargs['grader_kwargs'] = {'device': device} + + # Add quick mode if specified + if quick: + loader_kwargs['quick'] = True + + loader = create_cross_provider_loader( + solver_provider=solver_provider, + grader_provider=grader_provider, + solver_model=solver_model, + grader_model=grader_model, + **loader_kwargs + ) + + # Health check + logger.info("Performing health check...") + if not await loader.health_check(): + raise RuntimeError(f"Health check failed") + + # Cost estimation + logger.info("Estimating costs...") + cost_info = await loader.estimate_cost(len(problems)) + logger.info(f"Estimated cost: ${cost_info.get('total_cost', 0):.2f}") + + # Progress tracking + remaining_problems = [p for p in problems if p.get('index', 'unknown') not in completed_indices] + progress_bar = tqdm(total=len(problems), desc=f"Evaluating (solver={solver_provider}, grader={grader_provider or solver_provider})", initial=len(completed_indices)) + + # Semaphore for concurrency control + semaphore = asyncio.Semaphore(max_concurrent) + + def save_checkpoint(): + """Save current state to checkpoint file - simplified version""" + checkpoint_data = { + 'timestamp': datetime.now().isoformat(), + # Only save essential state information + 'completed_indices': list(completed_indices), + 'successful_indices': successful_indices, + 'failed_indices': failed_indices, + 'correct_indices': correct_indices, + 'results': results, + # Save minimal config for reference (not for resume) + 'dataset_path': str(dataset_path), # For convenience + 'total_problems': len(problems), + 'current_config': { + 'solver_provider': solver_provider, + 'grader_provider': grader_provider or solver_provider, + 'variant_type': variant_type, + 'solver_model': loader.solver_model, + 'grader_model': loader.grader_model + } + } + + # Write to temporary file first, then move (atomic operation) + temp_file = checkpoint_file.with_suffix('.tmp') + with open(temp_file, 'w', encoding='utf-8') as f: + json.dump(checkpoint_data, f, indent=2, ensure_ascii=False) + + # Atomic rename + temp_file.replace(checkpoint_file) + + async def evaluate_problem(problem_data: Dict[str, Any]) -> Dict[str, Any]: + """Evaluate a single problem with concurrency control.""" + problem_index = problem_data.get('index', 'unknown') + + # Skip if already completed + if problem_index in completed_indices: + return None + + async with semaphore: + try: + result = await loader.test_single_problem( + problem_data, + variant_type=variant_type + ) + + # Track success/failure based on technical completion, not correctness + if result.get('status') == 'completed': + successful_indices.append(result['index']) # Successfully processed + if result.get('correct'): + correct_indices.append(result['index']) # Also correct + else: + failed_indices.append(result['index']) # Technical failure + + # Add to results and mark as completed + results.append(result) + completed_indices.add(problem_index) + + # Save checkpoint immediately after each problem + save_checkpoint() + + progress_bar.update(1) + progress_bar.set_postfix({ + 'success': len(successful_indices), + 'failed': len(failed_indices), + 'saved': len(completed_indices) + }) + + return result + + except Exception as e: + import traceback + + # Capture full error details + error_details = { + 'error_message': str(e), + 'error_type': type(e).__name__, + 'traceback': traceback.format_exc(), + 'timestamp': datetime.now().isoformat(), + 'problem_index': problem_index, + 'problem_title': problem_data.get('title', 'unknown') + } + + # Try to capture HTTP-specific details if available + if hasattr(e, 'response'): + try: + error_details['http_status'] = e.response.status_code + error_details['http_headers'] = dict(e.response.headers) + error_details['http_response_text'] = e.response.text + except: + pass + + # Try to capture request details if available + if hasattr(e, 'request'): + try: + error_details['request_method'] = e.request.method + error_details['request_url'] = e.request.url + error_details['request_headers'] = dict(e.request.headers) + # Don't log request body as it might contain sensitive info + except: + pass + + # Log detailed error + logger.error(f"DETAILED ERROR for problem {problem_index}:") + logger.error(f" Error Type: {error_details['error_type']}") + logger.error(f" Error Message: {error_details['error_message']}") + logger.error(f" Problem Title: {error_details['problem_title']}") + + if 'http_status' in error_details: + logger.error(f" HTTP Status: {error_details['http_status']}") + logger.error(f" HTTP Response: {error_details['http_response_text'][:500]}...") + + logger.error(f" Full Traceback:\n{error_details['traceback']}") + + # Save to detailed error log + error_log_file = output_file.parent / f"detailed_errors_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json" if output_file else Path(f"detailed_errors_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json") + + try: + # Load existing errors if file exists + if error_log_file.exists(): + with open(error_log_file, 'r') as f: + existing_errors = json.load(f) + else: + existing_errors = [] + + # Add new error + existing_errors.append(error_details) + + # Save updated errors + with open(error_log_file, 'w') as f: + json.dump(existing_errors, f, indent=2, ensure_ascii=False) + + logger.info(f"Detailed error saved to {error_log_file}") + + except Exception as save_error: + logger.error(f"Failed to save detailed error log: {save_error}") + + result = { + 'index': problem_index, + 'status': 'error', + 'error': str(e), + 'error_type': type(e).__name__, + 'error_details': error_details + } + + # Add to results and mark as completed (even if failed) + results.append(result) + failed_indices.append(problem_index) + completed_indices.add(problem_index) + + # Save checkpoint + save_checkpoint() + + progress_bar.update(1) + progress_bar.set_postfix({ + 'success': len(successful_indices), + 'failed': len(failed_indices), + 'saved': len(completed_indices) + }) + + return result + + # Run evaluations + start_time = time.time() + + try: + # Create tasks only for remaining problems + tasks = [evaluate_problem(problem) for problem in remaining_problems] + + if tasks: + # Execute all tasks concurrently (limited by semaphore) + await asyncio.gather(*tasks) + else: + logger.info("All problems already completed!") + + except KeyboardInterrupt: + logger.info("Evaluation interrupted by user. Progress saved to checkpoint.") + logger.info(f"To resume, use: --resume {checkpoint_file}") + raise + + finally: + progress_bar.close() + + # Calculate statistics + total_time = time.time() - start_time + completed_results = [r for r in results if r.get('status') == 'completed'] + grades = [r['grade']['grade'] for r in completed_results + if r.get('grade', {}).get('status') == 'success' and 'grade' in r.get('grade', {})] + + # Calculate numeric grades (CORRECT=5, INCORRECT=2.5) + numeric_grades = [5.0 if g == 'CORRECT' else 2.5 for g in grades] + average_grade = sum(numeric_grades) / len(numeric_grades) if numeric_grades else 0.0 + + model_info = loader.get_model_info() + + summary = { + 'total_problems': len(problems), + 'completed': len(completed_results), + 'successful': len(successful_indices), # Technical success (completed processing) + 'failed': len(failed_indices), # Technical failures + 'correct_answers': len(correct_indices), # Mathematically correct answers + 'incorrect_answers': len(successful_indices) - len(correct_indices), # Wrong but processed + 'success_rate': (len(successful_indices) / len(problems) * 100) if problems else 0, # Technical success rate + 'accuracy_rate': (len(correct_indices) / len(successful_indices) * 100) if successful_indices else 0, # Correctness rate + 'average_grade': average_grade, + 'total_time_seconds': total_time, + 'problems_per_second': len(problems) / total_time if total_time > 0 else 0, + 'solver_provider': model_info.get('solver_provider', solver_provider), + 'grader_provider': model_info.get('grader_provider', grader_provider or solver_provider), + 'variant_type': variant_type, + 'solver_model': loader.solver_model, + 'grader_model': loader.grader_model, + 'max_concurrent': max_concurrent, + 'estimated_cost': cost_info, + 'is_cross_provider': model_info.get('is_cross_provider', False) + } + + # Create full results + full_results = { + 'summary': summary, + 'problems': results, + 'successful_indices': successful_indices, # Technical successes + 'failed_indices': failed_indices, # Technical failures + 'correct_indices': correct_indices, # Correct answers + 'timestamp': datetime.now().isoformat() + } + + # Save if requested + if output_file: + logger.info(f"Saving results to {output_file}") + with open(output_file, 'w', encoding='utf-8') as f: + json.dump(full_results, f, indent=2, ensure_ascii=False) + + # Print summary + logger.info(f"\n{'='*60}") + logger.info("CROSS-PROVIDER EVALUATION SUMMARY") + logger.info(f"{'='*60}") + logger.info(f"Solver Provider: {summary['solver_provider']} ({loader.solver_model})") + logger.info(f"Grader Provider: {summary['grader_provider']} ({loader.grader_model})") + logger.info(f"Variant: {variant_type}") + logger.info(f"Total problems: {summary['total_problems']}") + logger.info(f"✅ Successfully processed: {summary['successful']} ({summary['success_rate']:.1f}%)") + logger.info(f"💥 Technical failures: {summary['failed']}") + logger.info(f"🎯 Correct answers: {summary['correct_answers']} ({summary['accuracy_rate']:.1f}% of processed)") + logger.info(f"❌ Wrong answers: {summary['incorrect_answers']}") + logger.info(f"Average grade: {summary['average_grade']:.2f}") + logger.info(f"Total time: {summary['total_time_seconds']:.1f}s") + logger.info(f"Speed: {summary['problems_per_second']:.2f} problems/second") + + # Cleanup + if hasattr(loader, '__aexit__'): + await loader.__aexit__(None, None, None) + + return full_results + + +async def batch_evaluate_all_variants(dataset_path: Path, provider: str, + variants: List[str] = None, + max_concurrent: int = 3, max_files: int = None, + solver_model: str = None, grader_model: str = None, + output_dir: Path = None, + base_url: str = None, device: str = None) -> Dict[str, Any]: + """ + Batch evaluate problems across all variants using specified provider. + + Args: + dataset_path: Path to dataset directory + provider: AI provider name + variants: List of variants to test (None for all) + max_concurrent: Maximum concurrent evaluations + max_files: Maximum number of files to process per variant (None for all) + solver_model: Override solver model + grader_model: Override grader model + output_dir: Output directory path + **loader_kwargs: Additional arguments for loader + + Returns: + Dictionary with all variant results and comparative analysis + """ + if variants is None: + variants = ["original", "descriptive_long", "descriptive_long_confusing", + "descriptive_long_misleading", "garbled_string", "kernel_variant"] + + if output_dir is None: + output_dir = Path("results") + + logger = logging.getLogger(__name__) + + timestamp = datetime.now().strftime('%Y%m%d_%H%M%S') + config_name = f"{provider}" + if solver_model: + config_name += f"_{solver_model.replace('/', '_').replace('-', '_')}" + + # Create configuration-specific output directory + config_output_dir = output_dir / f"{config_name}_{timestamp}" + config_output_dir.mkdir(parents=True, exist_ok=True) + + # Prepare loader kwargs based on provider + loader_kwargs = {} + if provider == 'vllm' and base_url: + loader_kwargs['base_url'] = base_url + elif provider == 'huggingface' and device: + loader_kwargs['device'] = device + + logger.info(f"🚀 Starting multi-variant test for {config_name}") + logger.info(f"📊 Testing {len(variants)} variants with up to {max_files or 'ALL'} files each") + + overall_start_time = time.time() + variant_results = {} + + # Create overall progress bar for variants if tqdm is available + if HAS_TQDM: + variant_progress = tqdm.tqdm(total=len(variants), desc="Variants", + unit="variant", position=1, leave=True) + + for i, variant in enumerate(variants): + logger.info(f"\n📝 [{i+1}/{len(variants)}] Testing variant: {variant}") + variant_start_time = time.time() + + # Output file for this variant + variant_output_file = config_output_dir / f"{variant}_{timestamp}.json" + + try: + # Run batch evaluation for this variant + result = await batch_evaluate( + dataset_path=dataset_path, + provider=provider, + variant_type=variant, + max_concurrent=max_concurrent, + max_files=max_files, + solver_model=solver_model, + grader_model=grader_model, + output_file=variant_output_file, + **loader_kwargs + ) + + variant_time = time.time() - variant_start_time + + # Extract key metrics + summary = result.get('summary', {}) + variant_results[variant] = { + 'status': 'success', + 'output_file': str(variant_output_file), + 'total_problems': summary.get('total_problems', 0), + 'successful_evaluations': summary.get('successful', 0), + 'correct_evaluations': summary.get('correct_answers', 0), + 'incorrect_evaluations': summary.get('incorrect_answers', 0), + 'failed_evaluations': summary.get('failed', 0), + 'success_rate': summary.get('success_rate', 0), + 'average_grade': summary.get('average_grade', 0), + 'total_processing_time': summary.get('total_time_seconds', 0), + 'avg_time_per_problem': summary.get('problems_per_second', 0), + 'variant_test_time': variant_time, + 'grade_distribution': result.get('problems', []) # Assuming 'problems' contains all results + } + + logger.info(f"✅ {variant}: " + f"Grade {summary.get('average_grade', 0):.2f}, " + f"Success {summary.get('success_rate', 0):.1f}%, " + f"Time {variant_time/60:.1f}min") + + except Exception as e: + variant_time = time.time() - variant_start_time + error_msg = str(e) + + variant_results[variant] = { + 'status': 'failed', + 'error': error_msg, + 'variant_test_time': variant_time + } + + logger.error(f"❌ {variant} failed: {error_msg}") + + # Update variant progress bar + if HAS_TQDM and 'variant_progress' in locals(): + variant_progress.update(1) + successful_variants_count = len([v for v, r in variant_results.items() if r.get('status') == 'success']) + variant_progress.set_postfix({ + 'Success': successful_variants_count, + 'Failed': len(variant_results) - successful_variants_count + }) + + # Close variant progress bar + if HAS_TQDM and 'variant_progress' in locals(): + variant_progress.close() + + overall_time = time.time() - overall_start_time + + # Generate comprehensive summary + successful_variants = [v for v, r in variant_results.items() if r.get('status') == 'success'] + failed_variants = [v for v, r in variant_results.items() if r.get('status') == 'failed'] + + # Calculate aggregate statistics + if successful_variants: + total_problems = sum(variant_results[v].get('total_problems', 0) for v in successful_variants) + total_successful = sum(variant_results[v].get('successful_evaluations', 0) for v in successful_variants) + total_correct = sum(variant_results[v].get('correct_evaluations', 0) for v in successful_variants) + total_incorrect = sum(variant_results[v].get('incorrect_evaluations', 0) for v in successful_variants) + total_failed = sum(variant_results[v].get('failed_evaluations', 0) for v in successful_variants) + + grades = [variant_results[v].get('average_grade', 0) for v in successful_variants] + success_rates = [variant_results[v].get('success_rate', 0) for v in successful_variants] + times = [variant_results[v].get('avg_time_per_problem', 0) for v in successful_variants] + + overall_avg_grade = sum(grades) / len(grades) if grades else 0 + overall_success_rate = sum(success_rates) / len(success_rates) if success_rates else 0 + overall_avg_time = sum(times) / len(times) if times else 0 + + # Find best and worst performing variants + best_variant = max(successful_variants, key=lambda v: variant_results[v].get('average_grade', 0)) + worst_variant = min(successful_variants, key=lambda v: variant_results[v].get('average_grade', 0)) + + fastest_variant = min(successful_variants, key=lambda v: variant_results[v].get('avg_time_per_problem', float('inf'))) + slowest_variant = max(successful_variants, key=lambda v: variant_results[v].get('avg_time_per_problem', 0)) + else: + total_problems = total_successful = total_correct = total_incorrect = total_failed = 0 + overall_avg_grade = overall_success_rate = overall_avg_time = 0 + best_variant = worst_variant = fastest_variant = slowest_variant = None + + summary_result = { + 'configuration': { + 'provider': provider, + 'solver_model': solver_model, + 'grader_model': grader_model, + 'base_url': base_url, + 'device': device, + 'timestamp': timestamp + }, + 'test_overview': { + 'total_variants_tested': len(variant_results), + 'successful_variants': len(successful_variants), + 'failed_variants': len(failed_variants), + 'total_test_time_minutes': overall_time / 60, + 'variants_list': list(variant_results.keys()) + }, + 'aggregate_metrics': { + 'total_problems_across_variants': total_problems, + 'total_successful_evaluations': total_successful, + 'total_correct_evaluations': total_correct, + 'total_incorrect_evaluations': total_incorrect, + 'total_technical_failures': total_failed, + 'overall_average_grade': overall_avg_grade, + 'overall_success_rate': overall_success_rate, + 'overall_avg_time_per_problem': overall_avg_time + }, + 'variant_comparison': { + 'best_performing_variant': { + 'variant': best_variant, + 'grade': variant_results.get(best_variant, {}).get('average_grade', 0) if best_variant else 0 + }, + 'worst_performing_variant': { + 'variant': worst_variant, + 'grade': variant_results.get(worst_variant, {}).get('average_grade', 0) if worst_variant else 0 + }, + 'fastest_variant': { + 'variant': fastest_variant, + 'time_per_problem': variant_results.get(fastest_variant, {}).get('avg_time_per_problem', 0) if fastest_variant else 0 + }, + 'slowest_variant': { + 'variant': slowest_variant, + 'time_per_problem': variant_results.get(slowest_variant, {}).get('avg_time_per_problem', 0) if slowest_variant else 0 + } + }, + 'detailed_variant_results': variant_results + } + + # Save configuration summary + summary_file = config_output_dir / f"SUMMARY_{config_name}_{timestamp}.json" + with open(summary_file, 'w', encoding='utf-8') as f: + json.dump(summary_result, f, indent=2, ensure_ascii=False) + + # Print summary to console + logger.info("\n" + "="*80) + logger.info("📊 MULTI-VARIANT TEST SUMMARY REPORT") + logger.info("="*80) + + logger.info(f"🤖 Provider: {provider}") + if solver_model: + logger.info(f"🧠 Solver Model: {solver_model}") + if grader_model: + logger.info(f"📝 Grader Model: {grader_model}") + + logger.info(f"\n📋 Test Overview:") + logger.info(f" Total variants tested: {len(variant_results)}") + logger.info(f" Successful variants: {len(successful_variants)}") + logger.info(f" Failed variants: {len(failed_variants)}") + logger.info(f" Total test time: {overall_time/60:.1f} minutes") + + if total_problems > 0: + logger.info(f"\n📈 Aggregate Performance:") + logger.info(f" Total problems: {total_problems}") + logger.info(f" Overall average grade: {overall_avg_grade:.2f}") + logger.info(f" Overall success rate: {overall_success_rate:.1f}%") + logger.info(f" Average time per problem: {overall_avg_time:.2f}s") + + if best_variant: + logger.info(f"\n🏆 Variant Performance:") + logger.info(f" Best performing: {best_variant} (Grade: {variant_results[best_variant]['average_grade']:.2f})") + logger.info(f" Worst performing: {worst_variant} (Grade: {variant_results[worst_variant]['average_grade']:.2f})") + logger.info(f" Fastest: {fastest_variant} ({variant_results[fastest_variant]['avg_time_per_problem']:.2f}s/problem)") + logger.info(f" Slowest: {slowest_variant} ({variant_results[slowest_variant]['avg_time_per_problem']:.2f}s/problem)") + + logger.info("="*80) + logger.info(f"💾 Configuration summary saved to {summary_file}") + + return summary_result + + +async def main(): + """Main function.""" + parser = argparse.ArgumentParser(description="Batch evaluate mathematical problems") + + # Required arguments + parser.add_argument("--provider", required=True, choices=get_supported_providers(), + help="AI provider to use") + + # Dataset options + parser.add_argument("--dataset", default="dataset", + help="Dataset directory path (default: dataset)") + parser.add_argument("--variant", default="original", + choices=["original", "descriptive_long", "descriptive_long_confusing", + "descriptive_long_misleading", "garbled_string", "kernel_variant"], + help="Problem variant to use (default: original)") + parser.add_argument("--all-variants", action="store_true", + help="Test all 6 problem variants instead of just one") + parser.add_argument("--variants", nargs="+", + choices=["original", "descriptive_long", "descriptive_long_confusing", + "descriptive_long_misleading", "garbled_string", "kernel_variant"], + help="Specific variants to test (use with --all-variants)") + parser.add_argument("--max-files", type=int, + help="Maximum number of files to process per variant (default: all)") + + # Processing options + parser.add_argument("--max-concurrent", type=int, default=3, + help="Maximum concurrent evaluations (default: 3)") + parser.add_argument("--solver-model", + help="Override solver model") + parser.add_argument("--grader-model", + help="Override grader model") + + # Output options + parser.add_argument("--output", type=Path, + help="Output file path (default: results/[provider]_[timestamp].json)") + parser.add_argument("--output-dir", type=Path, default="results", + help="Output directory (default: results)") + parser.add_argument("--resume", type=Path, + help="Path to checkpoint file to resume from") + + # Provider-specific options + parser.add_argument("--base-url", + help="Base URL for VLLM provider") + parser.add_argument("--device", default="auto", + help="Device for HuggingFace provider (auto/cuda/cpu)") + + args = parser.parse_args() + + # Setup output directory and logging + args.output_dir.mkdir(parents=True, exist_ok=True) + logger = setup_logging(args.output_dir) + + # Default output file if not specified + if not args.output: + timestamp = datetime.now().strftime('%Y%m%d_%H%M%S') + args.output = args.output_dir / f"{args.provider}_{args.variant}_{timestamp}.json" + + # Prepare loader kwargs based on provider + loader_kwargs = {} + if args.provider == 'vllm' and args.base_url: + loader_kwargs['base_url'] = args.base_url + elif args.provider == 'huggingface' and args.device: + loader_kwargs['device'] = args.device + + try: + if args.all_variants or args.variants: + # Multi-variant evaluation + variants_to_test = args.variants if args.variants else None + results = await batch_evaluate_all_variants( + dataset_path=Path(args.dataset), + provider=args.provider, + variants=variants_to_test, + max_concurrent=args.max_concurrent, + max_files=args.max_files, + solver_model=args.solver_model, + grader_model=args.grader_model, + output_dir=args.output_dir, + base_url=args.base_url, + device=args.device + ) + + logger.info(f"Multi-variant evaluation completed successfully!") + logger.info(f"Overall average grade: {results['aggregate_metrics']['overall_average_grade']:.2f}") + logger.info(f"Overall success rate: {results['aggregate_metrics']['overall_success_rate']:.1f}%") + else: + # Single variant evaluation + results = await batch_evaluate( + dataset_path=Path(args.dataset), + provider=args.provider, + variant_type=args.variant, + max_concurrent=args.max_concurrent, + max_files=args.max_files, + solver_model=args.solver_model, + grader_model=args.grader_model, + output_file=args.output, + resume_checkpoint=args.resume, + **loader_kwargs + ) + + logger.info(f"Batch evaluation completed successfully!") + logger.info(f"Average grade: {results['summary']['average_grade']:.2f}") + logger.info(f"Success rate: {results['summary']['success_rate']:.1f}%") + + except KeyboardInterrupt: + logger.info("Evaluation interrupted by user") + except Exception as e: + logger.error(f"Evaluation failed: {str(e)}") + return 1 + + return 0 + + +if __name__ == "__main__": + exit(asyncio.run(main()))
\ No newline at end of file diff --git a/putnam-bench-anon/scripts/benchmark.py b/putnam-bench-anon/scripts/benchmark.py new file mode 100644 index 0000000..2fed228 --- /dev/null +++ b/putnam-bench-anon/scripts/benchmark.py @@ -0,0 +1,481 @@ +#!/usr/bin/env python3 +""" +Benchmark script for comparing AI providers and models on mathematical problems. + +This script runs comparative evaluations across multiple providers, models, and +problem variants to assess performance, accuracy, cost, and speed trade-offs. + +Usage: + python benchmark.py --config benchmark_config.json + python benchmark.py --quick-test # Quick 3-problem test across all providers + python benchmark.py --providers openai anthropic --models gpt-4o-mini claude-3-5-haiku +""" + +import asyncio +import json +import sys +import time +from pathlib import Path +import argparse +from typing import List, Dict, Any, Tuple +import logging +from datetime import datetime +import itertools +import statistics + +# Add the loader module to the path +sys.path.append(str(Path(__file__).parent)) + +from loader import create_loader, get_supported_providers, get_default_models + + +class BenchmarkRunner: + """Benchmark runner for AI providers.""" + + def __init__(self, output_dir: Path = Path("benchmark_results")): + self.output_dir = output_dir + self.output_dir.mkdir(parents=True, exist_ok=True) + + # Setup logging + log_file = self.output_dir / f"benchmark_{datetime.now().strftime('%Y%m%d_%H%M%S')}.log" + logging.basicConfig( + level=logging.INFO, + format='%(asctime)s - %(levelname)s - %(message)s', + handlers=[ + logging.FileHandler(log_file), + logging.StreamHandler(sys.stdout) + ] + ) + self.logger = logging.getLogger(__name__) + + async def load_test_problems(self, dataset_path: Path, max_problems: int = 10) -> List[Dict[str, Any]]: + """Load test problems from dataset.""" + json_files = list(dataset_path.glob("*.json"))[:max_problems] + + problems = [] + for json_file in json_files: + try: + with open(json_file, 'r', encoding='utf-8') as f: + data = json.load(f) + data['_source_file'] = str(json_file.name) + problems.append(data) + except Exception as e: + self.logger.warning(f"Failed to load {json_file}: {str(e)}") + + return problems + + async def run_single_configuration(self, + provider: str, + solver_model: str, + grader_model: str, + problems: List[Dict[str, Any]], + variant_type: str = "original", + **loader_kwargs) -> Dict[str, Any]: + """Run benchmark for a single provider/model configuration.""" + config_name = f"{provider}_{solver_model}_{grader_model}".replace("/", "_").replace("-", "_") + self.logger.info(f"🚀 Testing configuration: {config_name}") + + result = { + 'configuration': { + 'provider': provider, + 'solver_model': solver_model, + 'grader_model': grader_model, + 'variant_type': variant_type, + 'loader_kwargs': loader_kwargs + }, + 'metrics': {}, + 'problems': [], + 'errors': [] + } + + try: + # Create loader + loader = create_loader( + provider, + solver_model=solver_model, + grader_model=grader_model, + **loader_kwargs + ) + + # Health check + if not await loader.health_check(): + raise RuntimeError(f"Health check failed for {provider}") + + # Cost estimation + cost_info = await loader.estimate_cost(len(problems)) + result['metrics']['estimated_cost'] = cost_info + + # Process each problem + start_time = time.time() + grades = [] + processing_times = [] + + for i, problem in enumerate(problems): + problem_start = time.time() + + try: + problem_result = await loader.test_single_problem( + problem, + variant_type=variant_type + ) + + processing_time = time.time() - problem_start + # Convert boolean 'correct' to numeric grade (10 for correct, 0 for incorrect) + grade = 10 if problem_result.get('correct', False) else 0 + + grades.append(grade) + processing_times.append(processing_time) + + result['problems'].append({ + 'source_file': problem.get('_source_file', f'problem_{i}'), + 'grade': grade, + 'processing_time': processing_time, + 'solution_length': len(problem_result.get('solution', '')), + 'grading_feedback_length': len(str(problem_result.get('grading_result', {}).get('feedback', ''))) + }) + + self.logger.info(f" Problem {i+1}/{len(problems)}: Grade {grade} ({processing_time:.2f}s)") + + except Exception as e: + error_info = { + 'problem_index': i, + 'source_file': problem.get('_source_file', f'problem_{i}'), + 'error': str(e), + 'processing_time': time.time() - problem_start + } + result['errors'].append(error_info) + self.logger.error(f" Problem {i+1}/{len(problems)} failed: {str(e)}") + + total_time = time.time() - start_time + + # Calculate metrics + if grades: + result['metrics'].update({ + 'total_problems': len(problems), + 'successful_problems': len(grades), + 'failed_problems': len(result['errors']), + 'success_rate': len(grades) / len(problems) * 100, + 'average_grade': statistics.mean(grades), + 'median_grade': statistics.median(grades), + 'grade_std': statistics.stdev(grades) if len(grades) > 1 else 0, + 'max_grade': max(grades), + 'min_grade': min(grades), + 'total_time': total_time, + 'average_time_per_problem': statistics.mean(processing_times), + 'median_time_per_problem': statistics.median(processing_times), + 'total_time_successful': sum(processing_times), + 'throughput_problems_per_minute': len(grades) / (total_time / 60) if total_time > 0 else 0 + }) + else: + result['metrics'].update({ + 'total_problems': len(problems), + 'successful_problems': 0, + 'failed_problems': len(result['errors']), + 'success_rate': 0, + 'total_time': total_time, + 'error_rate': 100 + }) + + self.logger.info(f"✅ Configuration completed: {result['metrics']['success_rate']:.1f}% success, " + f"avg grade: {result['metrics'].get('average_grade', 0):.2f}") + + except Exception as e: + result['metrics']['fatal_error'] = str(e) + self.logger.error(f"❌ Configuration failed: {str(e)}") + + return result + + async def run_comparative_benchmark(self, + configurations: List[Dict[str, Any]], + problems: List[Dict[str, Any]], + variant_type: str = "original") -> Dict[str, Any]: + """Run comparative benchmark across multiple configurations.""" + self.logger.info(f"🏁 Starting comparative benchmark with {len(configurations)} configurations") + self.logger.info(f"📊 Testing {len(problems)} problems with variant: {variant_type}") + + benchmark_start = time.time() + results = [] + + for i, config in enumerate(configurations): + self.logger.info(f"\n📋 Configuration {i+1}/{len(configurations)}") + + provider = config['provider'] + solver_model = config.get('solver_model') + grader_model = config.get('grader_model') + loader_kwargs = config.get('loader_kwargs', {}) + + # Use defaults if not specified + if not solver_model or not grader_model: + defaults = get_default_models(provider) + solver_model = solver_model or defaults['solver_model'] + grader_model = grader_model or defaults['grader_model'] + + config_result = await self.run_single_configuration( + provider=provider, + solver_model=solver_model, + grader_model=grader_model, + problems=problems, + variant_type=variant_type, + **loader_kwargs + ) + + results.append(config_result) + + total_benchmark_time = time.time() - benchmark_start + + # Generate comparison report + report = self.generate_comparison_report(results, total_benchmark_time) + + # Save detailed results + timestamp = datetime.now().strftime('%Y%m%d_%H%M%S') + detailed_file = self.output_dir / f"benchmark_detailed_{timestamp}.json" + with open(detailed_file, 'w', encoding='utf-8') as f: + json.dump({ + 'benchmark_info': { + 'timestamp': datetime.now().isoformat(), + 'total_configurations': len(configurations), + 'total_problems': len(problems), + 'variant_type': variant_type, + 'total_time': total_benchmark_time + }, + 'configurations': configurations, + 'results': results, + 'comparison_report': report + }, f, indent=2, ensure_ascii=False) + + self.logger.info(f"💾 Detailed results saved to {detailed_file}") + + return report + + def generate_comparison_report(self, results: List[Dict[str, Any]], total_time: float) -> Dict[str, Any]: + """Generate comparison report from benchmark results.""" + self.logger.info("\n" + "="*60) + self.logger.info("📊 BENCHMARK COMPARISON REPORT") + self.logger.info("="*60) + + # Filter successful results + successful_results = [r for r in results if r['metrics'].get('success_rate', 0) > 0] + + if not successful_results: + self.logger.warning("⚠️ No successful configurations found!") + return {'error': 'No successful configurations'} + + # Ranking by different metrics + rankings = { + 'accuracy': sorted(successful_results, key=lambda x: x['metrics']['average_grade'], reverse=True), + 'speed': sorted(successful_results, key=lambda x: x['metrics']['average_time_per_problem']), + 'throughput': sorted(successful_results, key=lambda x: x['metrics']['throughput_problems_per_minute'], reverse=True), + 'success_rate': sorted(successful_results, key=lambda x: x['metrics']['success_rate'], reverse=True) + } + + # Print rankings + for metric, ranked_results in rankings.items(): + self.logger.info(f"\n🏆 Top 3 by {metric.upper()}:") + for i, result in enumerate(ranked_results[:3]): + config = result['configuration'] + metrics = result['metrics'] + provider = config['provider'] + solver = config['solver_model'] + + if metric == 'accuracy': + value = f"{metrics['average_grade']:.2f}" + elif metric == 'speed': + value = f"{metrics['average_time_per_problem']:.2f}s" + elif metric == 'throughput': + value = f"{metrics['throughput_problems_per_minute']:.1f} prob/min" + elif metric == 'success_rate': + value = f"{metrics['success_rate']:.1f}%" + + self.logger.info(f" {i+1}. {provider}/{solver}: {value}") + + # Calculate cost efficiency + cost_efficiency = [] + for result in successful_results: + metrics = result['metrics'] + cost_info = metrics.get('estimated_cost', {}) + total_cost = cost_info.get('total_cost', 0) + avg_grade = metrics.get('average_grade', 0) + + if total_cost > 0 and avg_grade > 0: + efficiency = avg_grade / total_cost # Grade per unit cost + cost_efficiency.append({ + 'result': result, + 'efficiency': efficiency, + 'cost': total_cost, + 'grade': avg_grade + }) + + if cost_efficiency: + cost_efficiency.sort(key=lambda x: x['efficiency'], reverse=True) + self.logger.info(f"\n💰 Top 3 by COST EFFICIENCY (Grade/Cost):") + for i, item in enumerate(cost_efficiency[:3]): + config = item['result']['configuration'] + provider = config['provider'] + solver = config['solver_model'] + self.logger.info(f" {i+1}. {provider}/{solver}: {item['efficiency']:.2f} " + f"(Grade: {item['grade']:.2f}, Cost: {item['cost']:.4f})") + + # Overall statistics + all_grades = [] + all_times = [] + all_success_rates = [] + + for result in successful_results: + metrics = result['metrics'] + all_grades.append(metrics['average_grade']) + all_times.append(metrics['average_time_per_problem']) + all_success_rates.append(metrics['success_rate']) + + self.logger.info(f"\n📈 OVERALL STATISTICS:") + self.logger.info(f" Configurations tested: {len(results)}") + self.logger.info(f" Successful configurations: {len(successful_results)}") + self.logger.info(f" Average grade across all: {statistics.mean(all_grades):.2f}") + self.logger.info(f" Average time per problem: {statistics.mean(all_times):.2f}s") + self.logger.info(f" Average success rate: {statistics.mean(all_success_rates):.1f}%") + self.logger.info(f" Total benchmark time: {total_time/60:.2f} minutes") + + # Generate final report + report = { + 'summary': { + 'total_configurations': len(results), + 'successful_configurations': len(successful_results), + 'overall_avg_grade': statistics.mean(all_grades) if all_grades else 0, + 'overall_avg_time': statistics.mean(all_times) if all_times else 0, + 'overall_avg_success_rate': statistics.mean(all_success_rates) if all_success_rates else 0, + 'total_benchmark_time': total_time + }, + 'rankings': { + metric: [ + { + 'provider': r['configuration']['provider'], + 'solver_model': r['configuration']['solver_model'], + 'grader_model': r['configuration']['grader_model'], + 'score': r['metrics'][metric_key] + } + for r in ranked[:5] # Top 5 + ] for metric, ranked in rankings.items() + for metric_key in [{'accuracy': 'average_grade', 'speed': 'average_time_per_problem', + 'throughput': 'throughput_problems_per_minute', 'success_rate': 'success_rate'}[metric]] + }, + 'cost_efficiency': [ + { + 'provider': item['result']['configuration']['provider'], + 'solver_model': item['result']['configuration']['solver_model'], + 'efficiency': item['efficiency'], + 'grade': item['grade'], + 'cost': item['cost'] + } + for item in cost_efficiency[:5] + ] if cost_efficiency else [] + } + + return report + + +async def run_quick_test(): + """Run a quick test across all providers with 3 problems.""" + runner = BenchmarkRunner() + + # Load 3 test problems + problems = await runner.load_test_problems(Path("dataset"), max_problems=3) + if not problems: + print("❌ No test problems found in dataset directory") + return + + # Default configurations for all providers + configurations = [] + for provider in get_supported_providers(): + config = {'provider': provider} + + # Provider-specific settings + if provider == 'vllm': + config['loader_kwargs'] = {'base_url': 'http://localhost:8000/v1'} + elif provider == 'huggingface': + config['loader_kwargs'] = { + 'device': 'cpu', + 'solver_model': 'microsoft/DialoGPT-small', + 'grader_model': 'microsoft/DialoGPT-small' + } + + configurations.append(config) + + # Run benchmark + await runner.run_comparative_benchmark(configurations, problems) + + +async def run_custom_benchmark(config_file: Path): + """Run benchmark from configuration file.""" + with open(config_file, 'r', encoding='utf-8') as f: + config = json.load(f) + + runner = BenchmarkRunner(Path(config.get('output_dir', 'benchmark_results'))) + + # Load problems + dataset_path = Path(config.get('dataset_path', 'dataset')) + max_problems = config.get('max_problems', 10) + variant_type = config.get('variant_type', 'original') + + problems = await runner.load_test_problems(dataset_path, max_problems) + if not problems: + print(f"❌ No problems found in {dataset_path}") + return + + # Load configurations + configurations = config.get('configurations', []) + if not configurations: + print("❌ No configurations specified in config file") + return + + # Run benchmark + await runner.run_comparative_benchmark(configurations, problems, variant_type) + + +async def main(): + """Main function.""" + parser = argparse.ArgumentParser(description="Benchmark AI providers on mathematical problems") + + # Benchmark modes + group = parser.add_mutually_exclusive_group(required=True) + group.add_argument("--config", type=Path, help="Configuration file path") + group.add_argument("--quick-test", action="store_true", + help="Quick test with 3 problems across all providers") + + # Custom benchmark options + parser.add_argument("--providers", nargs="+", choices=get_supported_providers(), + help="Providers to test (for custom benchmark)") + parser.add_argument("--models", nargs="+", + help="Models to test (for custom benchmark)") + parser.add_argument("--dataset", type=Path, default="dataset", + help="Dataset path (default: dataset)") + parser.add_argument("--max-problems", type=int, default=10, + help="Maximum problems to test (default: 10)") + parser.add_argument("--variant", default="original", + choices=["original", "descriptive_long", "kernel_variant"], + help="Problem variant (default: original)") + parser.add_argument("--output-dir", type=Path, default="benchmark_results", + help="Output directory (default: benchmark_results)") + + args = parser.parse_args() + + try: + if args.quick_test: + await run_quick_test() + elif args.config: + await run_custom_benchmark(args.config) + else: + # Custom benchmark mode (placeholder for future implementation) + print("Custom benchmark mode not yet implemented. Use --config or --quick-test.") + return 1 + + return 0 + + except KeyboardInterrupt: + print("\n⏸️ Benchmark interrupted by user") + return 1 + except Exception as e: + print(f"\n❌ Benchmark failed: {str(e)}") + return 1 + + +if __name__ == "__main__": + exit(asyncio.run(main()))
\ No newline at end of file diff --git a/putnam-bench-anon/scripts/compare_original_vs_kernel_test.py b/putnam-bench-anon/scripts/compare_original_vs_kernel_test.py new file mode 100644 index 0000000..76952bd --- /dev/null +++ b/putnam-bench-anon/scripts/compare_original_vs_kernel_test.py @@ -0,0 +1,630 @@ +#!/usr/bin/env python3 +""" +原题 vs Kernel Variant 数学能力对比测试 +使用4o-mini解题,o3严格评分,比较两种题目的正确率差异 +""" + +import os +import json +import asyncio +import pathlib +import time +import re +import random +from typing import Dict, List, Tuple, Optional +import click +import tqdm +from openai import AsyncOpenAI, RateLimitError, APIError, APIConnectionError + +# Configuration +SOLVER_MODEL = "gpt-4o-mini" # 用于解题的模型 +GRADER_MODEL = "o3" # 用于评分的模型 +SRC_DIR = pathlib.Path("raw/json") +RESULTS_DIR = pathlib.Path("results/comparison_test") +RESULTS_DIR.mkdir(parents=True, exist_ok=True) + +RETRIES = 4 +TIMEOUT_BASE = 600 +RESP_FMT = {"type": "json_object"} + +# 解题系统prompt - 4o-mini +SOLVER_SYSTEM_PROMPT = """You are an expert mathematician solving competition-level problems. +Provide detailed, step-by-step solutions with clear mathematical reasoning. + +Requirements: +- Show all your work and intermediate steps +- Justify each major step of your reasoning +- Use proper mathematical notation +- Be thorough but concise +- State your final answer clearly + +Solve the problem completely and rigorously.""" + +SOLVER_USER_TEMPLATE = """Please solve this mathematical problem: + +{problem_statement} + +Provide a complete solution with detailed reasoning. Return your response in JSON format: +{{"solution": "your complete step-by-step solution with mathematical reasoning", + "final_answer": "your final answer in a clear, concise form"}}""" + +# 证明题严格评分系统prompt - o3 +PROOF_GRADER_SYSTEM_PROMPT = """You are an extremely strict mathematical grader evaluating competition-level PROOF problems. + +GRADING STANDARDS (BE VERY STRICT): +- Mathematical rigor: Every step must be mathematically sound and justified +- Logical flow: The reasoning must be clear, complete, and logically connected +- Correctness: All calculations, algebraic manipulations, and conclusions must be correct +- Completeness: The solution must address all parts of the problem fully +- Precision: Mathematical statements must be precise and unambiguous + +FAILING CRITERIA (Mark as INCORRECT if ANY of these apply): +- Any unjustified logical leap or gap in reasoning +- Any computational error, no matter how small +- Missing steps in critical parts of the argument +- Imprecise or ambiguous mathematical statements +- Incorrect final answer, even if approach is partially correct +- Circular reasoning or logical fallacies +- Misuse of mathematical theorems or definitions + +BE EXTREMELY STRICT. Competition mathematics proofs require perfect precision.""" + +# 计算题相对宽松评分系统prompt - o3 +CALCULATION_GRADER_SYSTEM_PROMPT = """You are a mathematical grader evaluating competition-level CALCULATION problems. + +GRADING STANDARDS FOR CALCULATION PROBLEMS: +- Primary focus: Is the final answer correct? +- Secondary focus: Is the overall approach reasonable and mathematically sound? +- Computation: Allow minor computational slips if the method is correct and final answer is right + +GRADING CRITERIA: +- CORRECT: Final answer is correct AND approach is fundamentally sound +- INCORRECT: Final answer is wrong OR approach is fundamentally flawed + +For calculation problems, the final numerical answer is the most important criterion. +Minor intermediate errors are acceptable if they don't affect the final result.""" + +PROOF_GRADER_USER_TEMPLATE = """Grade this PROOF solution with extreme strictness. + +PROBLEM: +{problem_statement} + +STUDENT SOLUTION: +{solution} + +CORRECT REFERENCE SOLUTION: +{reference_solution} + +Evaluate with maximum strictness. Every logical step must be perfect. Return JSON with: +{{"grade": "CORRECT" or "INCORRECT", + "detailed_feedback": "specific detailed analysis of what is right/wrong", + "major_issues": "list of significant mathematical errors or gaps", + "final_answer_correct": true or false, + "reasoning_rigor_score": 0-10 integer (10=perfect rigor, 0=severely flawed), + "overall_assessment": "comprehensive evaluation summary"}}""" + +CALCULATION_GRADER_USER_TEMPLATE = """Grade this CALCULATION solution with focus on final answer correctness. + +PROBLEM: +{problem_statement} + +STUDENT SOLUTION: +{solution} + +CORRECT REFERENCE SOLUTION: +{reference_solution} + +Focus primarily on whether the final answer is correct. Return JSON with: +{{"grade": "CORRECT" or "INCORRECT", + "detailed_feedback": "specific detailed analysis of what is right/wrong", + "major_issues": "list of significant mathematical errors or gaps", + "final_answer_correct": true or false, + "reasoning_rigor_score": 0-10 integer (10=perfect rigor, 0=severely flawed), + "overall_assessment": "comprehensive evaluation summary"}}""" + +JSON_RE = re.compile(r"\{[\s\S]*\}") + +def parse_json_response(raw: str) -> Optional[Dict]: + """Parse JSON from LLM response with fallback strategies.""" + if not raw: + return None + + try: + return json.loads(raw) + except: + pass + + match = JSON_RE.search(raw) + if match: + try: + return json.loads(match.group(0)) + except: + pass + + try: + fixed = raw.replace('\\"', '"').replace('\\\\', '\\') + return json.loads(fixed) + except: + pass + + return None + +def to_str(x) -> str: + """Convert various types to string safely.""" + if x is None: + return "" + if isinstance(x, str): + return x + if isinstance(x, (list, tuple)): + return "\n".join(map(str, x)) + return str(x) + +async def call_api_with_retry(cli: AsyncOpenAI, model: str, messages: List[Dict]) -> Tuple[Optional[Dict], str]: + """Make OpenAI API call with retry logic.""" + raw_response = "" + + for attempt in range(1, RETRIES + 1): + timeout = TIMEOUT_BASE * (2 ** (attempt - 1)) + try: + # Set temperature based on model + # o3, o3-mini, and o4-mini require temperature 1.0 + if any(model_name in model.lower() for model_name in ['o3', 'o3-mini', 'o4-mini']): + temperature = 1.0 + else: + # Use temperature 0.0 for deterministic solving with other models + temperature = 0.0 + + response = await asyncio.wait_for( + cli.chat.completions.create( + model=model, + messages=messages, + temperature=temperature, + response_format=RESP_FMT, + ), + timeout=timeout, + ) + raw_response = response.choices[0].message.content or "" + parsed = parse_json_response(raw_response) + if parsed: + return parsed, raw_response + raise ValueError("Failed to parse JSON response") + + except RateLimitError as e: + print(f"🚫 RateLimitError (attempt {attempt}/{RETRIES}): {str(e)}") + if "insufficient_quota" in str(e): + print("⏳ Detected quota exhaustion - sleeping 15 minutes") + await asyncio.sleep(900) + else: + sleep_time = 2 ** attempt + random.random() + print(f" ⏰ Rate limited, sleeping {sleep_time:.1f}s") + await asyncio.sleep(sleep_time) + + except (APIError, APIConnectionError, asyncio.TimeoutError, ValueError) as e: + print(f"❌ {type(e).__name__} (attempt {attempt}/{RETRIES}): {str(e)}") + if attempt == RETRIES: + return None, raw_response + sleep_time = 2 ** attempt + random.random() + print(f" ⏰ Retrying in {sleep_time:.1f}s") + await asyncio.sleep(sleep_time) + + return None, raw_response + +async def solve_problem(cli: AsyncOpenAI, problem_statement: str) -> Tuple[Optional[Dict], str]: + """让4o-mini解题""" + messages = [ + {"role": "system", "content": SOLVER_SYSTEM_PROMPT}, + {"role": "user", "content": SOLVER_USER_TEMPLATE.format( + problem_statement=problem_statement + )} + ] + return await call_api_with_retry(cli, SOLVER_MODEL, messages) + +async def grade_solution(cli: AsyncOpenAI, problem_statement: str, solution: str, + reference_solution: str, problem_type: str = "proof") -> Tuple[Optional[Dict], str]: + """让o3根据题型评分 - 证明题严格,计算题注重答案""" + if problem_type == "calculation": + system_prompt = CALCULATION_GRADER_SYSTEM_PROMPT + user_template = CALCULATION_GRADER_USER_TEMPLATE + else: # Default to proof (strict grading) + system_prompt = PROOF_GRADER_SYSTEM_PROMPT + user_template = PROOF_GRADER_USER_TEMPLATE + + messages = [ + {"role": "system", "content": system_prompt}, + {"role": "user", "content": user_template.format( + problem_statement=problem_statement, + solution=solution, + reference_solution=reference_solution + )} + ] + return await call_api_with_retry(cli, GRADER_MODEL, messages) + +async def test_single_file(file_path: pathlib.Path, cli: AsyncOpenAI) -> Dict: + """测试单个文件的原题和kernel variant""" + try: + # 加载数据 + data = json.loads(file_path.read_text(encoding='utf-8')) + index = data.get("index", file_path.stem) + + # 检查必要字段 + original_question = to_str(data.get("question", "")).strip() + original_solution = to_str(data.get("solution", "")).strip() + problem_type = data.get("problem_type", "proof") # 默认为证明题,严格评分 + + kv = data.get("variants", {}).get("kernel_variant") + if not kv: + return { + "index": index, + "status": "skipped", + "reason": "no_kernel_variant" + } + + kernel_question = to_str(kv.get("question", "")).strip() + kernel_solution = to_str(kv.get("solution", "")).strip() + + if not all([original_question, original_solution, kernel_question, kernel_solution]): + return { + "index": index, + "status": "skipped", + "reason": "missing_fields" + } + + print(f"🧮 Testing {index} (Type: {problem_type.upper()})") + start_time = time.time() + + result = { + "index": index, + "status": "completed", + "timestamp": time.time(), + "problem_type": problem_type, + "original": {}, + "kernel_variant": {}, + "comparison": {} + } + + # 1. 让4o-mini解原题 + print(f" 📝 Solving original problem...") + orig_solve_result, orig_solve_raw = await solve_problem(cli, original_question) + + if not orig_solve_result: + result["original"]["solve_status"] = "failed" + result["status"] = "failed" + return result + + orig_student_solution = to_str(orig_solve_result.get("solution", "")).strip() + orig_final_answer = to_str(orig_solve_result.get("final_answer", "")).strip() + + result["original"]["student_solution"] = orig_student_solution + result["original"]["student_final_answer"] = orig_final_answer + result["original"]["solve_status"] = "success" + + # 2. 让4o-mini解kernel variant + print(f" 📝 Solving kernel variant...") + kv_solve_result, kv_solve_raw = await solve_problem(cli, kernel_question) + + if not kv_solve_result: + result["kernel_variant"]["solve_status"] = "failed" + result["status"] = "failed" + return result + + kv_student_solution = to_str(kv_solve_result.get("solution", "")).strip() + kv_final_answer = to_str(kv_solve_result.get("final_answer", "")).strip() + + result["kernel_variant"]["student_solution"] = kv_student_solution + result["kernel_variant"]["student_final_answer"] = kv_final_answer + result["kernel_variant"]["solve_status"] = "success" + + # 3. o3根据题型评分原题解答 + grading_style = "STRICT" if problem_type == "proof" else "LENIENT" + print(f" 🔍 Grading original solution ({grading_style})...") + orig_grade_result, orig_grade_raw = await grade_solution( + cli, original_question, orig_student_solution, original_solution, problem_type + ) + + if not orig_grade_result: + result["original"]["grade_status"] = "failed" + else: + result["original"]["grade_status"] = "success" + result["original"]["grade"] = orig_grade_result.get("grade", "UNKNOWN") + result["original"]["detailed_feedback"] = orig_grade_result.get("detailed_feedback", "") + result["original"]["major_issues"] = orig_grade_result.get("major_issues", "") + result["original"]["final_answer_correct"] = orig_grade_result.get("final_answer_correct", False) + result["original"]["reasoning_rigor_score"] = orig_grade_result.get("reasoning_rigor_score", 0) + result["original"]["overall_assessment"] = orig_grade_result.get("overall_assessment", "") + + # 4. o3根据题型评分kernel variant解答 + print(f" 🔍 Grading kernel variant solution ({grading_style})...") + kv_grade_result, kv_grade_raw = await grade_solution( + cli, kernel_question, kv_student_solution, kernel_solution, problem_type + ) + + if not kv_grade_result: + result["kernel_variant"]["grade_status"] = "failed" + else: + result["kernel_variant"]["grade_status"] = "success" + result["kernel_variant"]["grade"] = kv_grade_result.get("grade", "UNKNOWN") + result["kernel_variant"]["detailed_feedback"] = kv_grade_result.get("detailed_feedback", "") + result["kernel_variant"]["major_issues"] = kv_grade_result.get("major_issues", "") + result["kernel_variant"]["final_answer_correct"] = kv_grade_result.get("final_answer_correct", False) + result["kernel_variant"]["reasoning_rigor_score"] = kv_grade_result.get("reasoning_rigor_score", 0) + result["kernel_variant"]["overall_assessment"] = kv_grade_result.get("overall_assessment", "") + + # 5. 比较分析 + if (result["original"]["grade_status"] == "success" and + result["kernel_variant"]["grade_status"] == "success"): + + orig_correct = result["original"]["grade"] == "CORRECT" + kv_correct = result["kernel_variant"]["grade"] == "CORRECT" + + result["comparison"]["original_correct"] = orig_correct + result["comparison"]["kernel_variant_correct"] = kv_correct + result["comparison"]["both_correct"] = orig_correct and kv_correct + result["comparison"]["both_incorrect"] = not orig_correct and not kv_correct + result["comparison"]["original_harder"] = not orig_correct and kv_correct # 原题更难 + result["comparison"]["kernel_variant_harder"] = orig_correct and not kv_correct # kernel variant更难 + + orig_rigor = result["original"]["reasoning_rigor_score"] + kv_rigor = result["kernel_variant"]["reasoning_rigor_score"] + result["comparison"]["rigor_difference"] = orig_rigor - kv_rigor # 正数=原题推理更严谨 + + total_time = time.time() - start_time + result["processing_time"] = total_time + + print(f" ✅ Completed {index} in {total_time:.1f}s") + if result["comparison"]: + orig_status = "✅" if result["comparison"]["original_correct"] else "❌" + kv_status = "✅" if result["comparison"]["kernel_variant_correct"] else "❌" + print(f" Original: {orig_status}, Kernel Variant: {kv_status}") + + return result + + except Exception as e: + return { + "index": index if 'index' in locals() else file_path.stem, + "status": "error", + "error": str(e), + "error_type": type(e).__name__, + "timestamp": time.time() + } + +async def save_detailed_results(results: List[Dict], output_file: str): + """保存详细结果""" + output_path = RESULTS_DIR / f"{output_file}_detailed.json" + try: + output_path.write_text(json.dumps(results, ensure_ascii=False, indent=2), encoding='utf-8') + print(f"💾 Detailed results saved to {output_path}") + except Exception as e: + print(f"❌ Failed to save detailed results: {e}") + +def generate_summary_report(results: List[Dict]) -> Dict: + """生成汇总报告""" + summary = { + "total_files": len(results), + "completed": 0, + "failed": 0, + "skipped": 0, + "by_problem_type": { + "proof": {"count": 0, "original_correct": 0, "kv_correct": 0}, + "calculation": {"count": 0, "original_correct": 0, "kv_correct": 0} + }, + "original_stats": {"correct": 0, "incorrect": 0, "total_graded": 0}, + "kernel_variant_stats": {"correct": 0, "incorrect": 0, "total_graded": 0}, + "comparison_stats": { + "both_correct": 0, + "both_incorrect": 0, + "original_harder": 0, + "kernel_variant_harder": 0, + "total_compared": 0 + }, + "rigor_analysis": { + "original_avg_rigor": 0, + "kernel_variant_avg_rigor": 0, + "rigor_difference_avg": 0 + } + } + + orig_rigor_scores = [] + kv_rigor_scores = [] + rigor_differences = [] + + for result in results: + if result["status"] == "completed": + summary["completed"] += 1 + + # 按题型统计 + ptype = result.get("problem_type", "proof") + if ptype in summary["by_problem_type"]: + summary["by_problem_type"][ptype]["count"] += 1 + if result["original"].get("grade") == "CORRECT": + summary["by_problem_type"][ptype]["original_correct"] += 1 + if result["kernel_variant"].get("grade") == "CORRECT": + summary["by_problem_type"][ptype]["kv_correct"] += 1 + + # 原题统计 + if result["original"].get("grade_status") == "success": + summary["original_stats"]["total_graded"] += 1 + if result["original"]["grade"] == "CORRECT": + summary["original_stats"]["correct"] += 1 + else: + summary["original_stats"]["incorrect"] += 1 + orig_rigor_scores.append(result["original"]["reasoning_rigor_score"]) + + # kernel variant统计 + if result["kernel_variant"].get("grade_status") == "success": + summary["kernel_variant_stats"]["total_graded"] += 1 + if result["kernel_variant"]["grade"] == "CORRECT": + summary["kernel_variant_stats"]["correct"] += 1 + else: + summary["kernel_variant_stats"]["incorrect"] += 1 + kv_rigor_scores.append(result["kernel_variant"]["reasoning_rigor_score"]) + + # 比较统计 + if result.get("comparison"): + summary["comparison_stats"]["total_compared"] += 1 + comp = result["comparison"] + if comp["both_correct"]: + summary["comparison_stats"]["both_correct"] += 1 + elif comp["both_incorrect"]: + summary["comparison_stats"]["both_incorrect"] += 1 + elif comp["original_harder"]: + summary["comparison_stats"]["original_harder"] += 1 + elif comp["kernel_variant_harder"]: + summary["comparison_stats"]["kernel_variant_harder"] += 1 + + rigor_differences.append(comp["rigor_difference"]) + + elif result["status"] == "skipped": + summary["skipped"] += 1 + else: + summary["failed"] += 1 + + # 计算平均分 + if orig_rigor_scores: + summary["rigor_analysis"]["original_avg_rigor"] = sum(orig_rigor_scores) / len(orig_rigor_scores) + if kv_rigor_scores: + summary["rigor_analysis"]["kernel_variant_avg_rigor"] = sum(kv_rigor_scores) / len(kv_rigor_scores) + if rigor_differences: + summary["rigor_analysis"]["rigor_difference_avg"] = sum(rigor_differences) / len(rigor_differences) + + # 计算正确率 + if summary["original_stats"]["total_graded"] > 0: + summary["original_stats"]["accuracy"] = summary["original_stats"]["correct"] / summary["original_stats"]["total_graded"] + + if summary["kernel_variant_stats"]["total_graded"] > 0: + summary["kernel_variant_stats"]["accuracy"] = summary["kernel_variant_stats"]["correct"] / summary["kernel_variant_stats"]["total_graded"] + + return summary + +def print_summary_report(summary: Dict): + """打印汇总报告""" + print("\n" + "="*80) + print("📊 ORIGINAL vs KERNEL VARIANT COMPARISON REPORT") + print("="*80) + + print(f"📁 Total files: {summary['total_files']}") + print(f"✅ Completed: {summary['completed']}") + print(f"⏭️ Skipped: {summary['skipped']}") + print(f"❌ Failed: {summary['failed']}") + + print(f"\n📈 ACCURACY COMPARISON:") + orig_acc = summary["original_stats"].get("accuracy", 0) * 100 + kv_acc = summary["kernel_variant_stats"].get("accuracy", 0) * 100 + print(f"Original Problems: {orig_acc:.1f}% ({summary['original_stats']['correct']}/{summary['original_stats']['total_graded']})") + print(f"Kernel Variants: {kv_acc:.1f}% ({summary['kernel_variant_stats']['correct']}/{summary['kernel_variant_stats']['total_graded']})") + + if orig_acc > 0 and kv_acc > 0: + diff = orig_acc - kv_acc + if diff > 5: + print(f"📉 Kernel variants are {diff:.1f}% harder (as expected)") + elif diff < -5: + print(f"📈 Original problems are {-diff:.1f}% harder (unexpected)") + else: + print(f"📊 Similar difficulty (difference: {diff:.1f}%)") + + print(f"\n🎯 BY PROBLEM TYPE:") + for ptype, stats in summary["by_problem_type"].items(): + if stats["count"] > 0: + orig_acc_type = (stats["original_correct"] / stats["count"]) * 100 + kv_acc_type = (stats["kv_correct"] / stats["count"]) * 100 + grading_note = " (STRICT grading)" if ptype == "proof" else " (LENIENT grading)" + print(f"{ptype.upper()} Problems{grading_note}:") + print(f" Original: {orig_acc_type:.1f}% ({stats['original_correct']}/{stats['count']})") + print(f" Kernel Variant: {kv_acc_type:.1f}% ({stats['kv_correct']}/{stats['count']})") + if stats["count"] >= 3: # Only show difference if we have enough samples + type_diff = orig_acc_type - kv_acc_type + print(f" Difference: {type_diff:+.1f}%") + + print(f"\n🔍 DETAILED COMPARISON:") + comp = summary["comparison_stats"] + total = comp["total_compared"] + if total > 0: + print(f"Both correct: {comp['both_correct']:3d} ({comp['both_correct']/total*100:.1f}%)") + print(f"Both incorrect: {comp['both_incorrect']:3d} ({comp['both_incorrect']/total*100:.1f}%)") + print(f"Original harder: {comp['original_harder']:3d} ({comp['original_harder']/total*100:.1f}%)") + print(f"Kernel variant harder: {comp['kernel_variant_harder']:3d} ({comp['kernel_variant_harder']/total*100:.1f}%)") + + print(f"\n📏 REASONING RIGOR ANALYSIS:") + rigor = summary["rigor_analysis"] + print(f"Original avg rigor: {rigor['original_avg_rigor']:.2f}/10") + print(f"Kernel variant rigor: {rigor['kernel_variant_avg_rigor']:.2f}/10") + print(f"Difference: {rigor['rigor_difference_avg']:.2f} (positive = original more rigorous)") + + print("="*80) + +@click.command() +@click.option("-c", "--concurrency", default=16, show_default=True, + help="Maximum concurrent processing tasks") +@click.option("--max-files", default=50, show_default=True, + help="Maximum number of files to test (for quick testing)") +@click.option("--file-pattern", default="*.json", show_default=True, + help="File pattern to process") +@click.option("--output-prefix", default="comparison_test", show_default=True, + help="Prefix for output files") +@click.option("--debug", is_flag=True, help="Enable debug output") +def main(concurrency: int, max_files: int, file_pattern: str, output_prefix: str, debug: bool): + """原题 vs Kernel Variant 数学能力对比测试""" + print(f"🧪 Starting Original vs Kernel Variant Comparison Test") + print(f" Solver Model: {SOLVER_MODEL}") + print(f" Grader Model: {GRADER_MODEL}") + print(f" Max files: {max_files}") + print(f" Concurrency: {concurrency}") + + if not os.getenv("OPENAI_API_KEY"): + print("❌ OPENAI_API_KEY environment variable not set!") + return + + # 找到测试文件 + all_files = sorted(SRC_DIR.glob(file_pattern)) + if max_files > 0: + all_files = all_files[:max_files] + + print(f"📁 Testing {len(all_files)} files") + + if not all_files: + print("❌ No files found to test!") + return + + async def run_test(): + cli = AsyncOpenAI() + sem = asyncio.Semaphore(concurrency) + + async def worker(file_path: pathlib.Path): + async with sem: + return await test_single_file(file_path, cli) + + # 执行测试 + results = [] + progress_bar = tqdm.tqdm(total=len(all_files), desc="Testing", unit="file") + + tasks = [worker(f) for f in all_files] + for coro in asyncio.as_completed(tasks): + result = await coro + results.append(result) + progress_bar.update(1) + + progress_bar.close() + return results + + # 运行测试 + results = asyncio.run(run_test()) + + # 保存详细结果 + timestamp = int(time.time()) + output_name = f"{output_prefix}_{timestamp}" + asyncio.run(save_detailed_results(results, output_name)) + + # 生成并显示汇总报告 + summary = generate_summary_report(results) + print_summary_report(summary) + + # 保存汇总报告 + summary_path = RESULTS_DIR / f"{output_name}_summary.json" + try: + summary_path.write_text(json.dumps(summary, ensure_ascii=False, indent=2), encoding='utf-8') + print(f"💾 Summary report saved to {summary_path}") + except Exception as e: + print(f"❌ Failed to save summary: {e}") + +if __name__ == "__main__": + main() +
\ No newline at end of file diff --git a/putnam-bench-anon/scripts/health_check.py b/putnam-bench-anon/scripts/health_check.py new file mode 100644 index 0000000..65c7855 --- /dev/null +++ b/putnam-bench-anon/scripts/health_check.py @@ -0,0 +1,376 @@ +#!/usr/bin/env python3 +""" +Health check script for all AI providers. + +This script tests connectivity, API keys, and basic functionality for all +supported AI providers. Useful for troubleshooting and verifying setup. + +Usage: + python health_check.py # Check all providers + python health_check.py --provider openai # Check specific provider + python health_check.py --detailed # Detailed diagnostics +""" + +import asyncio +import json +import sys +import os +from pathlib import Path +import argparse +from typing import Dict, List, Any +from datetime import datetime +import platform + +# Add the loader module to the path +sys.path.append(str(Path(__file__).parent)) + +from loader import create_loader, get_supported_providers, get_default_models + + +class HealthChecker: + """Health checker for AI providers.""" + + def __init__(self, detailed: bool = False): + self.detailed = detailed + self.results = {} + + async def check_system_info(self) -> Dict[str, Any]: + """Check system information.""" + import psutil + + return { + 'python_version': platform.python_version(), + 'platform': platform.platform(), + 'cpu_count': psutil.cpu_count(), + 'memory_total_gb': round(psutil.virtual_memory().total / (1024**3), 2), + 'memory_available_gb': round(psutil.virtual_memory().available / (1024**3), 2), + 'disk_free_gb': round(psutil.disk_usage('.').free / (1024**3), 2), + 'timestamp': datetime.now().isoformat() + } + + async def check_environment_variables(self) -> Dict[str, Any]: + """Check required environment variables.""" + env_vars = { + 'OPENAI_API_KEY': os.getenv('OPENAI_API_KEY'), + 'ANTHROPIC_API_KEY': os.getenv('ANTHROPIC_API_KEY'), + 'GOOGLE_API_KEY': os.getenv('GOOGLE_API_KEY'), + } + + return { + var: { + 'set': bool(value), + 'length': len(value) if value else 0, + 'preview': value[:8] + '...' if value and len(value) > 8 else value + } + for var, value in env_vars.items() + } + + async def check_dependencies(self) -> Dict[str, Any]: + """Check required Python packages.""" + dependencies = { + 'openai': 'OpenAI API client', + 'anthropic': 'Anthropic API client', + 'google-generativeai': 'Google Gemini API client', + 'transformers': 'HuggingFace transformers', + 'torch': 'PyTorch for local models', + 'vllm': 'VLLM for local serving', + 'psutil': 'System monitoring' + } + + results = {} + for package, description in dependencies.items(): + try: + if package == 'google-generativeai': + import google.generativeai + version = getattr(google.generativeai, '__version__', 'unknown') + else: + module = __import__(package) + version = getattr(module, '__version__', 'unknown') + + results[package] = { + 'installed': True, + 'version': version, + 'description': description + } + except ImportError: + results[package] = { + 'installed': False, + 'version': None, + 'description': description + } + + return results + + async def check_provider(self, provider: str) -> Dict[str, Any]: + """Check a specific AI provider.""" + print(f"🔍 Checking {provider}...") + + result = { + 'provider': provider, + 'available': False, + 'health_check_passed': False, + 'error': None, + 'response_time': None, + 'models': {}, + 'cost_estimation': None + } + + try: + # Get default models + default_models = get_default_models(provider) + result['models']['defaults'] = default_models + + # Provider-specific configuration + loader_kwargs = {} + if provider == 'vllm': + loader_kwargs['base_url'] = 'http://localhost:8000/v1' + elif provider == 'huggingface': + loader_kwargs['device'] = 'cpu' # Use CPU for testing + # Use smaller models for testing + loader_kwargs['solver_model'] = 'microsoft/DialoGPT-small' + loader_kwargs['grader_model'] = 'microsoft/DialoGPT-small' + + # Create loader + start_time = asyncio.get_event_loop().time() + loader = create_loader(provider, **loader_kwargs) + creation_time = asyncio.get_event_loop().time() - start_time + + result['available'] = True + result['creation_time'] = creation_time + + # Get model info + model_info = loader.get_model_info() + result['models']['configured'] = model_info + + # Health check + health_start = asyncio.get_event_loop().time() + health_passed = await asyncio.wait_for(loader.health_check(), timeout=60) + health_time = asyncio.get_event_loop().time() - health_start + + result['health_check_passed'] = health_passed + result['response_time'] = health_time + + if health_passed: + # Cost estimation + try: + cost_info = await loader.estimate_cost(10) + result['cost_estimation'] = cost_info + except Exception as e: + result['cost_estimation_error'] = str(e) + + # Try to list models if available + if hasattr(loader, 'list_models'): + try: + available_models = await loader.list_models() + result['models']['available'] = available_models[:10] # Limit output + except Exception as e: + result['models']['list_error'] = str(e) + + except asyncio.TimeoutError: + result['error'] = 'Health check timed out' + except Exception as e: + result['error'] = str(e) + + return result + + async def check_all_providers(self, specific_provider: str = None) -> Dict[str, Any]: + """Check all providers or a specific one.""" + providers = [specific_provider] if specific_provider else get_supported_providers() + + print("🏥 AI Provider Health Check") + print("=" * 50) + + # System information + if self.detailed: + print("📊 System Information:") + system_info = await self.check_system_info() + for key, value in system_info.items(): + print(f" {key}: {value}") + print() + + # Environment variables + print("🔧 Environment Variables:") + env_info = await self.check_environment_variables() + for var, info in env_info.items(): + status = "✅" if info['set'] else "❌" + print(f" {status} {var}: {'Set' if info['set'] else 'Not set'}") + print() + + # Dependencies + print("📦 Dependencies:") + dep_info = await self.check_dependencies() + for package, info in dep_info.items(): + status = "✅" if info['installed'] else "❌" + version = f" (v{info['version']})" if info['installed'] and info['version'] != 'unknown' else "" + print(f" {status} {package}{version}") + print() + + # Provider checks + print("🤖 Provider Health Checks:") + provider_results = {} + + for provider in providers: + provider_result = await self.check_provider(provider) + provider_results[provider] = provider_result + + # Print summary + if provider_result['available']: + if provider_result['health_check_passed']: + status = "✅" + details = f"({provider_result['response_time']:.2f}s)" + else: + status = "⚠️" + details = "(Health check failed)" + else: + status = "❌" + details = f"({provider_result['error']})" + + print(f" {status} {provider.upper()}: {details}") + + print() + + # Summary + total_providers = len(providers) + healthy_providers = sum(1 for r in provider_results.values() + if r['available'] and r['health_check_passed']) + + print("📋 Summary:") + print(f" Total providers checked: {total_providers}") + print(f" Healthy providers: {healthy_providers}") + print(f" Success rate: {healthy_providers/total_providers*100:.1f}%") + + # Detailed results + final_results = { + 'timestamp': datetime.now().isoformat(), + 'summary': { + 'total_providers': total_providers, + 'healthy_providers': healthy_providers, + 'success_rate': healthy_providers/total_providers*100 + }, + 'environment': env_info, + 'dependencies': dep_info, + 'providers': provider_results + } + + if self.detailed: + final_results['system'] = system_info + + return final_results + + async def run_diagnostics(self, provider: str) -> Dict[str, Any]: + """Run detailed diagnostics for a specific provider.""" + print(f"🔧 Running detailed diagnostics for {provider}...") + + result = await self.check_provider(provider) + + # Additional detailed checks + if result['available'] and result['health_check_passed']: + print(f"✅ {provider} is healthy!") + + # Test with a simple problem + print("🧪 Testing with a simple math problem...") + try: + loader_kwargs = {} + if provider == 'vllm': + loader_kwargs['base_url'] = 'http://localhost:8000/v1' + elif provider == 'huggingface': + loader_kwargs['device'] = 'cpu' + loader_kwargs['solver_model'] = 'microsoft/DialoGPT-small' + loader_kwargs['grader_model'] = 'microsoft/DialoGPT-small' + + loader = create_loader(provider, **loader_kwargs) + + # Simple test problem + test_problem = { + 'original': { + 'problem_statement': 'What is 2 + 2?', + 'solution': 'The answer is 4.', + 'problem_type': 'calculation' + } + } + + start_time = asyncio.get_event_loop().time() + test_result = await asyncio.wait_for( + loader.test_single_problem(test_problem, variant_type='original'), + timeout=120 + ) + test_time = asyncio.get_event_loop().time() - start_time + + result['test_problem'] = { + 'success': True, + 'time': test_time, + 'grade': 10 if test_result.get('correct', False) else 0, + 'solution_length': len(test_result.get('solve', {}).get('solution', '')) + } + print(f" ✅ Test completed in {test_time:.2f}s") + print(f" 📊 Grade: {10 if test_result.get('correct', False) else 0} ({'CORRECT' if test_result.get('correct', False) else 'INCORRECT'})") + + except asyncio.TimeoutError: + result['test_problem'] = {'success': False, 'error': 'Test timed out'} + print(" ⚠️ Test problem timed out") + except Exception as e: + result['test_problem'] = {'success': False, 'error': str(e)} + print(f" ❌ Test problem failed: {str(e)}") + + return result + + +async def main(): + """Main function.""" + parser = argparse.ArgumentParser(description="Health check for AI providers") + parser.add_argument("--provider", choices=get_supported_providers(), + help="Check specific provider only") + parser.add_argument("--detailed", action="store_true", + help="Show detailed system information") + parser.add_argument("--diagnostics", action="store_true", + help="Run detailed diagnostics (requires --provider)") + parser.add_argument("--output", type=Path, + help="Save results to JSON file") + parser.add_argument("--quiet", action="store_true", + help="Suppress output, save to file only") + + args = parser.parse_args() + + if args.diagnostics and not args.provider: + print("❌ Error: --diagnostics requires --provider") + return 1 + + # Redirect output if quiet + if args.quiet: + import io + sys.stdout = io.StringIO() + + checker = HealthChecker(detailed=args.detailed) + + try: + if args.diagnostics: + results = await checker.run_diagnostics(args.provider) + else: + results = await checker.check_all_providers(args.provider) + + # Save to file if requested + if args.output: + args.output.parent.mkdir(parents=True, exist_ok=True) + with open(args.output, 'w', encoding='utf-8') as f: + json.dump(results, f, indent=2, ensure_ascii=False) + + if not args.quiet: + print(f"\n💾 Results saved to {args.output}") + + # Print JSON if quiet mode + if args.quiet: + sys.stdout = sys.__stdout__ + print(json.dumps(results, indent=2)) + + return 0 + + except KeyboardInterrupt: + print("\n⏸️ Health check interrupted by user") + return 1 + except Exception as e: + print(f"\n❌ Health check failed: {str(e)}") + return 1 + + +if __name__ == "__main__": + exit(asyncio.run(main()))
\ No newline at end of file diff --git a/putnam-bench-anon/scripts/regrade.py b/putnam-bench-anon/scripts/regrade.py new file mode 100644 index 0000000..ffc177e --- /dev/null +++ b/putnam-bench-anon/scripts/regrade.py @@ -0,0 +1,284 @@ +#!/usr/bin/env python3 +""" +Re-grade an existing results JSON file using a (possibly different) grader model. + +The script loads a results file produced by `batch_evaluate.py` (or a compatible +JSON list) and re-grades every problem using the specified grader. No solving +is performed – instead we reuse the previously generated solutions stored in +`solve.solution`. + +Example usage +------------- +python regrade.py \ + --results-file results/o3/o3_original.json \ + --dataset-dir dataset/ \ + --provider openai \ + --grader-model o3 \ + --max-concurrent 5 \ + --output results/regraded_o3_original.json + +""" + +import argparse +import asyncio +import json +import sys +import time +from pathlib import Path +from typing import Any, Dict, List +from datetime import datetime +import logging + +# Determine directories +SCRIPT_DIR = Path(__file__).resolve().parent +PROJECT_ROOT = SCRIPT_DIR.parent # one level up + +# Add both the script dir and project root to PYTHONPATH to locate 'loader' +sys.path.append(str(SCRIPT_DIR)) +sys.path.append(str(PROJECT_ROOT)) + +from loader import create_loader # type: ignore + +try: + from tqdm import tqdm # type: ignore + HAS_TQDM = True +except ImportError: # pragma: no cover + HAS_TQDM = False + + class tqdm: # type: ignore + """Minimal fallback if tqdm is not available.""" + + def __init__(self, total=None, desc=None, **kwargs): + self.total = total + self.n = 0 + self.desc = desc or "" + print(f"{self.desc}: starting …") + + def update(self, n=1): + self.n += n + if self.total: + pct = self.n / self.total * 100 + print(f"{self.desc}: {self.n}/{self.total} ({pct:.1f}%)", end="\r") + + def set_postfix(self, _): + pass + + def close(self): + print() # newline + + +############################################################################### +# Helper functions +############################################################################### + + +def load_dataset(dataset_dir: Path) -> Dict[str, Dict[str, Any]]: + """Read every JSON file in *dataset_dir* and return a mapping index → data.""" + dataset: Dict[str, Dict[str, Any]] = {} + for json_file in dataset_dir.glob("*.json"): + try: + with open(json_file, "r", encoding="utf-8") as fh: + data = json.load(fh) + idx = data.get("index") + if idx: + dataset[idx] = data + except Exception as exc: # pragma: no cover – best-effort ingest + logging.warning("Failed to load %s: %s", json_file, exc) + return dataset + + +async def regrade_problem(loader, # type: ignore[valid-type] + problem_record: Dict[str, Any], + dataset_entry: Dict[str, Any], + variant_type: str) -> Dict[str, Any]: + """Re-grade one problem and return a new result dict.""" + + idx = problem_record.get("index", "unknown") + problem_type = dataset_entry.get("problem_type", "proof") + + # Extract question & reference solution according to variant + if variant_type == "original": + question = str(dataset_entry.get("question", "")).strip() + reference_solution = str(dataset_entry.get("solution", "")).strip() + else: + variant = dataset_entry.get("variants", {}).get(variant_type, {}) + question = str(variant.get("question", "")).strip() + reference_solution = str(variant.get("solution", "")).strip() + + if not question or not reference_solution: + return { + "index": idx, + "status": "skipped", + "reason": "missing_fields", + } + + # Previously generated solution + student_solution = str(problem_record.get("solve", {}).get("solution", "")).strip() + final_answer = str(problem_record.get("solve", {}).get("final_answer", "")).strip() + + # Grade the solution (temperature hard-coded inside create_loader for o-series) + grade_result, _raw = await loader.grade_solution( + question, + student_solution, + reference_solution, + problem_type, + ) + + # Build merged record retaining original fields + new grade + new_record = { + "index": idx, + "variant_type": variant_type, + "problem_type": problem_type, + "solve": { + "solution": student_solution, + "final_answer": final_answer, + }, + "grade": grade_result or {"status": "failed"}, + } + + # Convenience shortcut for correctness + new_record["correct"] = new_record["grade"].get("grade") == "CORRECT" + return new_record + + +############################################################################### +# Main orchestration +############################################################################### + + +async def main() -> None: # noqa: C901 – single entry-point + parser = argparse.ArgumentParser(description="Re-grade an existing results file") + parser.add_argument("--results-file", required=True, type=Path, help="Path to existing results JSON") + parser.add_argument("--dataset-dir", required=True, type=Path, help="Directory containing dataset JSON files") + parser.add_argument("--provider", default="openai", help="Grader provider (default: openai)") + parser.add_argument("--grader-model", default="o3", help="Grader model name (default: o3)") + parser.add_argument("--max-concurrent", type=int, default=3, help="Max concurrent API calls") + parser.add_argument("--variant-type", default="original", help="Problem variant used in results file") + parser.add_argument("--output", type=Path, help="Where to write re-graded results (JSON)") + parser.add_argument("--quick", action="store_true", help="Quick mode – single retry, shorter timeouts") + parser.add_argument("--debug", action="store_true", help="Verbose JSON-parsing debug") + + args = parser.parse_args() + + # Configure logging early + logging.basicConfig( + level=logging.INFO, + format="%(asctime)s [%(levelname)s] %(message)s", + handlers=[logging.StreamHandler(sys.stdout)], + ) + + if not args.results_file.exists(): + logging.error("Results file %s does not exist", args.results_file) + sys.exit(1) + + if not args.dataset_dir.exists(): + logging.error("Dataset directory %s does not exist", args.dataset_dir) + sys.exit(1) + + # Load dataset into memory once + logging.info("Loading dataset from %s", args.dataset_dir) + dataset_map = load_dataset(args.dataset_dir) + logging.info("Loaded %d dataset entries", len(dataset_map)) + + # Load results JSON (support two formats: {'problems':[...]} or simple list) + with open(args.results_file, "r", encoding="utf-8") as fh: + raw_data = json.load(fh) + + if isinstance(raw_data, dict) and "problems" in raw_data: + original_problems: List[Dict[str, Any]] = raw_data["problems"] # type: ignore[assignment] + elif isinstance(raw_data, list): + original_problems = raw_data # type: ignore[assignment] + else: + logging.error("Unsupported results file structure – expected list or dict with key 'problems'.") + sys.exit(1) + + if not original_problems: + logging.warning("No problems found in results file – nothing to re-grade.") + sys.exit(0) + + # Create loader – we only need grader, but solver_model must be provided; reuse grader_model + loader = create_loader( + args.provider, + solver_model=args.grader_model, + grader_model=args.grader_model, + quick=args.quick, + debug=args.debug, + ) + + if not await loader.health_check(): + logging.error("Health check failed for provider %s", args.provider) + sys.exit(1) + + # Estimate costs (rough – assumes avg lengths; tweak as needed) + cost_info = await loader.estimate_cost(len(original_problems)) + logging.info("Estimated grading cost: $%.2f", cost_info.get("total_cost", 0)) + + # Concurrency control + semaphore = asyncio.Semaphore(args.max_concurrent) + + async def wrapper(problem_record): + idx = problem_record.get("index", "unknown") + if idx not in dataset_map: + logging.warning("Dataset entry for index %s not found – skipping", idx) + return {"index": idx, "status": "skipped", "reason": "dataset_missing"} + async with semaphore: + return await regrade_problem( + loader, + problem_record, + dataset_map[idx], + args.variant_type, + ) + + # Progress bar setup + pbar = tqdm(total=len(original_problems), desc="Re-grading") + results: List[Dict[str, Any]] = [] + + async def gather_tasks(): + for coro in asyncio.as_completed([wrapper(rec) for rec in original_problems]): + res = await coro + results.append(res) + pbar.update(1) + await gather_tasks() + pbar.close() + + # Build summary + completed = [r for r in results if r.get("grade", {}).get("status") == "success"] + grades = [r["grade"].get("grade") for r in completed] + numeric = [5.0 if g == "CORRECT" else 2.5 for g in grades] + + summary = { + "total_problems": len(results), + "completed": len(completed), + "correct": sum(1 for g in grades if g == "CORRECT"), + "incorrect": sum(1 for g in grades if g == "INCORRECT"), + "average_grade": sum(numeric) / len(numeric) if numeric else 0.0, + "provider": args.provider, + "grader_model": args.grader_model, + "variant_type": args.variant_type, + "estimated_cost": cost_info, + "timestamp": datetime.now().isoformat(), + } + + output_payload = { + "summary": summary, + "problems": results, + } + + # Determine output path + if args.output: + out_path = args.output + else: + stem = args.results_file.stem + f"_regraded_{args.grader_model}" + out_path = args.results_file.with_name(stem + args.results_file.suffix) + + with open(out_path, "w", encoding="utf-8") as fh: + json.dump(output_payload, fh, indent=2, ensure_ascii=False) + logging.info("Saved re-graded results to %s", out_path) + + # Clean up HTTP client if applicable + if hasattr(loader, "__aexit__"): + await loader.__aexit__(None, None, None) + + +if __name__ == "__main__": + asyncio.run(main())
\ No newline at end of file diff --git a/putnam-bench-anon/setup_config.py b/putnam-bench-anon/setup_config.py new file mode 100644 index 0000000..a59a0e0 --- /dev/null +++ b/putnam-bench-anon/setup_config.py @@ -0,0 +1,440 @@ +#!/usr/bin/env python3 +""" +Configuration setup script for Putnam Problem Solver. + +This script helps users set up API keys, configure providers, +and verify their environment for mathematical problem solving. + +Usage: + python setup_config.py # Interactive setup + python setup_config.py --check # Check current configuration + python setup_config.py --provider openai # Setup specific provider +""" + +import asyncio +import json +import sys +import os +from pathlib import Path +import argparse +from typing import Dict, Any, Optional +import getpass +import subprocess + +# Add the loader module to the path +sys.path.append(str(Path(__file__).parent)) + +from loader import get_supported_providers, get_default_models + + +class ConfigManager: + """Configuration manager for Putnam Problem Solver.""" + + def __init__(self): + self.config_file = Path.home() / ".putnam_config.json" + self.env_file = Path.home() / ".putnam_env" + + def print_banner(self): + """Print setup banner.""" + print("🛠️ Putnam Problem Solver Configuration Setup") + print("=" * 55) + print("This script will help you configure API keys and settings.") + print() + + def load_config(self) -> Dict[str, Any]: + """Load existing configuration.""" + if self.config_file.exists(): + try: + with open(self.config_file, 'r', encoding='utf-8') as f: + return json.load(f) + except Exception: + pass + return {} + + def save_config(self, config: Dict[str, Any]): + """Save configuration to file.""" + with open(self.config_file, 'w', encoding='utf-8') as f: + json.dump(config, f, indent=2) + print(f"✅ Configuration saved to {self.config_file}") + + def update_env_file(self, env_vars: Dict[str, str]): + """Update environment file.""" + lines = [] + + # Read existing lines + if self.env_file.exists(): + with open(self.env_file, 'r') as f: + lines = [line.strip() for line in f if line.strip() and not line.startswith('#')] + + # Remove existing vars that we're updating + lines = [line for line in lines if not any(line.startswith(f"{var}=") for var in env_vars)] + + # Add new vars + for var, value in env_vars.items(): + if value: + lines.append(f"export {var}={value}") + + # Write back + with open(self.env_file, 'w') as f: + f.write("#!/bin/bash\n") + f.write("# Putnam Problem Solver Environment Variables\n") + f.write("# Source this file: source ~/.putnam_env\n\n") + for line in lines: + f.write(line + "\n") + + print(f"✅ Environment file updated: {self.env_file}") + print(f"💡 Add to your shell profile: source {self.env_file}") + + def check_dependencies(self) -> Dict[str, bool]: + """Check required dependencies.""" + dependencies = { + 'python': True, # We're running Python + 'pip': self._command_exists('pip'), + 'openai': self._package_installed('openai'), + 'anthropic': self._package_installed('anthropic'), + 'google-generativeai': self._package_installed('google-generativeai'), + 'transformers': self._package_installed('transformers'), + 'torch': self._package_installed('torch'), + 'vllm': self._package_installed('vllm'), + 'psutil': self._package_installed('psutil') + } + return dependencies + + def _command_exists(self, command: str) -> bool: + """Check if a command exists.""" + try: + subprocess.run([command, '--version'], + capture_output=True, check=True) + return True + except (subprocess.CalledProcessError, FileNotFoundError): + return False + + def _package_installed(self, package: str) -> bool: + """Check if a Python package is installed.""" + try: + if package == 'google-generativeai': + import google.generativeai + else: + __import__(package) + return True + except ImportError: + return False + + def install_dependencies(self, packages: list): + """Install missing dependencies.""" + if not packages: + print("✅ All dependencies are installed!") + return + + print(f"📦 Installing missing packages: {', '.join(packages)}") + + # Create requirements for missing packages + package_map = { + 'openai': 'openai', + 'anthropic': 'anthropic', + 'google-generativeai': 'google-generativeai', + 'transformers': 'transformers', + 'torch': 'torch', + 'vllm': 'vllm', + 'psutil': 'psutil' + } + + to_install = [package_map[pkg] for pkg in packages if pkg in package_map] + + if to_install: + try: + subprocess.run([sys.executable, '-m', 'pip', 'install'] + to_install, + check=True) + print("✅ Dependencies installed successfully!") + except subprocess.CalledProcessError as e: + print(f"❌ Failed to install dependencies: {e}") + + def setup_provider(self, provider: str, config: Dict[str, Any]) -> Dict[str, Any]: + """Setup a specific provider.""" + print(f"\n🔧 Setting up {provider.upper()}") + print("-" * 30) + + provider_config = config.get('providers', {}).get(provider, {}) + + if provider == 'openai': + api_key = self._get_api_key( + "OpenAI API Key", + provider_config.get('api_key'), + "Get your key from: https://platform.openai.com/api-keys" + ) + if api_key: + provider_config['api_key'] = api_key + os.environ['OPENAI_API_KEY'] = api_key + + elif provider == 'anthropic': + api_key = self._get_api_key( + "Anthropic API Key", + provider_config.get('api_key'), + "Get your key from: https://console.anthropic.com/" + ) + if api_key: + provider_config['api_key'] = api_key + os.environ['ANTHROPIC_API_KEY'] = api_key + + elif provider == 'gemini': + api_key = self._get_api_key( + "Google API Key", + provider_config.get('api_key'), + "Get your key from: https://makersuite.google.com/app/apikey" + ) + if api_key: + provider_config['api_key'] = api_key + os.environ['GOOGLE_API_KEY'] = api_key + + elif provider == 'kimi': + api_key = self._get_api_key( + "Kimi/Moonshot API Key", + provider_config.get('api_key'), + "Get your key from: https://platform.moonshot.ai/" + ) + if api_key: + provider_config['api_key'] = api_key + os.environ['MOONSHOT_API_KEY'] = api_key + + elif provider == 'vllm': + current_url = provider_config.get('base_url', 'http://localhost:8000/v1') + print(f"Current VLLM server URL: {current_url}") + new_url = input("Enter VLLM server URL (press Enter to keep current): ").strip() + if new_url: + provider_config['base_url'] = new_url + else: + provider_config['base_url'] = current_url + + print("💡 Make sure your VLLM server is running:") + print(" vllm serve meta-llama/Llama-3.2-8B-Instruct --port 8000") + + elif provider == 'huggingface': + print("HuggingFace runs locally - no API key needed.") + device = provider_config.get('device', 'auto') + print(f"Current device setting: {device}") + new_device = input("Device (auto/cuda/cpu) [press Enter to keep current]: ").strip() + if new_device in ['auto', 'cuda', 'cpu']: + provider_config['device'] = new_device + + print("💡 HuggingFace will download models on first use.") + + # Update config + if 'providers' not in config: + config['providers'] = {} + config['providers'][provider] = provider_config + + return config + + def _get_api_key(self, prompt: str, current_key: Optional[str], help_text: str) -> Optional[str]: + """Get API key from user.""" + if current_key: + masked_key = current_key[:8] + "..." if len(current_key) > 8 else "***" + print(f"Current key: {masked_key}") + + if input("Update API key? (y/n): ").lower().startswith('y'): + print(help_text) + return getpass.getpass(f"Enter {prompt}: ").strip() + else: + return current_key + else: + print(f"No {prompt} found.") + print(help_text) + if input("Enter API key now? (y/n): ").lower().startswith('y'): + return getpass.getpass(f"Enter {prompt}: ").strip() + + return None + + async def test_provider(self, provider: str) -> bool: + """Test a provider configuration.""" + print(f"🧪 Testing {provider}...") + + try: + from loader import create_loader + + loader_kwargs = {} + if provider == 'vllm': + config = self.load_config() + vllm_config = config.get('providers', {}).get('vllm', {}) + loader_kwargs['base_url'] = vllm_config.get('base_url', 'http://localhost:8000/v1') + elif provider == 'huggingface': + config = self.load_config() + hf_config = config.get('providers', {}).get('huggingface', {}) + loader_kwargs['device'] = hf_config.get('device', 'cpu') + loader_kwargs['solver_model'] = 'microsoft/DialoGPT-small' + loader_kwargs['grader_model'] = 'microsoft/DialoGPT-small' + + loader = create_loader(provider, **loader_kwargs) + + # Simple health check + is_healthy = await loader.health_check() + + if is_healthy: + print(f"✅ {provider} is working correctly!") + return True + else: + print(f"❌ {provider} health check failed") + return False + + except Exception as e: + print(f"❌ {provider} test failed: {str(e)}") + return False + + def check_current_config(self): + """Check and display current configuration.""" + print("📋 Current Configuration Status") + print("=" * 40) + + # Environment variables + print("\n🔧 Environment Variables:") + env_vars = { + 'OPENAI_API_KEY': os.getenv('OPENAI_API_KEY'), + 'ANTHROPIC_API_KEY': os.getenv('ANTHROPIC_API_KEY'), + 'GOOGLE_API_KEY': os.getenv('GOOGLE_API_KEY') + } + + for var, value in env_vars.items(): + if value: + masked = value[:8] + "..." if len(value) > 8 else "***" + print(f" ✅ {var}: {masked}") + else: + print(f" ❌ {var}: Not set") + + # Dependencies + print("\n📦 Dependencies:") + deps = self.check_dependencies() + for dep, installed in deps.items(): + status = "✅" if installed else "❌" + print(f" {status} {dep}") + + # Config file + print(f"\n📄 Config file: {self.config_file}") + if self.config_file.exists(): + print(" ✅ Exists") + config = self.load_config() + providers = config.get('providers', {}) + if providers: + print(" Configured providers:") + for provider in providers: + print(f" • {provider}") + else: + print(" ❌ Not found") + + # Environment file + print(f"\n🌍 Environment file: {self.env_file}") + if self.env_file.exists(): + print(" ✅ Exists") + print(f" 💡 Source with: source {self.env_file}") + else: + print(" ❌ Not found") + + async def interactive_setup(self): + """Run interactive setup.""" + self.print_banner() + + # Check dependencies first + print("🔍 Checking dependencies...") + deps = self.check_dependencies() + missing_deps = [dep for dep, installed in deps.items() if not installed and dep != 'pip'] + + if missing_deps: + print(f"\n⚠️ Missing dependencies: {', '.join(missing_deps)}") + if input("Install missing dependencies? (y/n): ").lower().startswith('y'): + self.install_dependencies(missing_deps) + + # Load existing config + config = self.load_config() + + # Provider setup + print(f"\n🤖 Available providers: {', '.join(get_supported_providers())}") + + # Ask which providers to configure + if input("\nConfigure all providers? (y/n): ").lower().startswith('y'): + providers_to_setup = get_supported_providers() + else: + providers_to_setup = [] + for provider in get_supported_providers(): + if input(f"Configure {provider}? (y/n): ").lower().startswith('y'): + providers_to_setup.append(provider) + + # Setup each provider + env_vars = {} + for provider in providers_to_setup: + config = self.setup_provider(provider, config) + + # Collect env vars + provider_config = config.get('providers', {}).get(provider, {}) + if provider == 'openai' and 'api_key' in provider_config: + env_vars['OPENAI_API_KEY'] = provider_config['api_key'] + elif provider == 'anthropic' and 'api_key' in provider_config: + env_vars['ANTHROPIC_API_KEY'] = provider_config['api_key'] + elif provider == 'gemini' and 'api_key' in provider_config: + env_vars['GOOGLE_API_KEY'] = provider_config['api_key'] + + # Save configuration + self.save_config(config) + + # Update environment file + if env_vars: + self.update_env_file(env_vars) + + # Test providers + if input("\nTest configured providers? (y/n): ").lower().startswith('y'): + print("\n🧪 Testing providers...") + for provider in providers_to_setup: + await self.test_provider(provider) + + print("\n🎉 Setup completed!") + print("\n💡 Next steps:") + print(" 1. Source environment file: source ~/.putnam_env") + print(" 2. Test a provider: python putnam_cli.py test --provider openai") + print(" 3. Solve a problem: python putnam_cli.py solve dataset/1938-A-1.json") + + +async def main(): + """Main function.""" + parser = argparse.ArgumentParser(description="Configure Putnam Problem Solver") + parser.add_argument("--check", action="store_true", help="Check current configuration") + parser.add_argument("--provider", choices=get_supported_providers(), + help="Setup specific provider only") + parser.add_argument("--install-deps", action="store_true", help="Install missing dependencies") + parser.add_argument("--test", choices=get_supported_providers(), + help="Test specific provider") + + args = parser.parse_args() + + manager = ConfigManager() + + try: + if args.check: + manager.check_current_config() + elif args.install_deps: + deps = manager.check_dependencies() + missing = [dep for dep, installed in deps.items() if not installed and dep != 'pip'] + manager.install_dependencies(missing) + elif args.test: + await manager.test_provider(args.test) + elif args.provider: + manager.print_banner() + config = manager.load_config() + config = manager.setup_provider(args.provider, config) + manager.save_config(config) + + # Test the provider + if input(f"Test {args.provider}? (y/n): ").lower().startswith('y'): + await manager.test_provider(args.provider) + else: + # Interactive setup + await manager.interactive_setup() + + return 0 + + except KeyboardInterrupt: + print("\n⏸️ Setup interrupted by user") + return 1 + except Exception as e: + print(f"\n❌ Setup failed: {str(e)}") + return 1 + + +if __name__ == "__main__": + exit(asyncio.run(main()))
\ No newline at end of file diff --git a/putnamsup/evaluate_putnam_gap.py b/putnamsup/evaluate_putnam_gap.py new file mode 100644 index 0000000..5c9f35e --- /dev/null +++ b/putnamsup/evaluate_putnam_gap.py @@ -0,0 +1,74 @@ +import json +import argparse +import re + +def normalize_answer(text): + """Simple normalization for comparison.""" + if text is None: return "" + text = text.strip().lower() + # Remove latex formatting for simple check + text = re.sub(r'\\[\(\)\[\]]', ' ', text) + return text + +def simple_evaluate(ground_truth, generated): + """ + A very naive evaluator. + Returns True if the generated answer seems to contain the ground truth + (if ground truth is short) or based on some heuristics. + """ + gt_norm = normalize_answer(ground_truth) + gen_norm = normalize_answer(generated) + + # If ground truth is very short (likely a number or variable), check if it's in the generated text + if len(gt_norm) < 20: + return gt_norm in gen_norm + + # For longer proofs, this metric is useless. + return False + +def main(): + parser = argparse.ArgumentParser(description="Evaluate PutnamGAP results") + parser.add_argument("--results_file", type=str, required=True, help="Path to JSONL results file") + args = parser.parse_args() + + total = 0 + correct_heuristic = 0 + by_type = {} + + print(f"Evaluating {args.results_file}...") + + with open(args.results_file, "r", encoding="utf-8") as f: + for line in f: + line = line.strip() + if not line: continue + + data = json.loads(line) + prob_type = data.get("problem_type", "unknown") + + total += 1 + if prob_type not in by_type: + by_type[prob_type] = {"count": 0, "heuristic_match": 0} + + by_type[prob_type]["count"] += 1 + + # This is a placeholder evaluation. + # Real evaluation for proofs needs an LLM judge. + is_match = simple_evaluate(data["solution"], data["generated_solution"]) + + if is_match: + correct_heuristic += 1 + by_type[prob_type]["heuristic_match"] += 1 + + print(f"Total processed: {total}") + print("-" * 40) + print("Breakdown by Problem Type:") + for p_type, stats in by_type.items(): + acc = (stats["heuristic_match"] / stats["count"]) * 100 if stats["count"] > 0 else 0 + print(f" {p_type}: {stats['count']} items, {stats['heuristic_match']} heuristic matches ({acc:.2f}%)") + print("-" * 40) + print("Note: The heuristic match is very basic (checks if short ground truth is substring of generated output).") + print("For 'proof' problems, this metric is not reliable. Use an LLM-based judge for accurate evaluation.") + +if __name__ == "__main__": + main() + diff --git a/putnamsup/putnam_utils.py b/putnamsup/putnam_utils.py new file mode 100644 index 0000000..7761c49 --- /dev/null +++ b/putnamsup/putnam_utils.py @@ -0,0 +1,95 @@ +import os +import json +from typing import Dict, Any, Generator, Tuple, Optional, List + +# Supported variants as seen in putnamgap_viewer.py +SUPPORTED_VARIANTS = [ + "original", + "descriptive_long", + "descriptive_long_confusing", + "descriptive_long_misleading", + "garbled_string", + "kernel_variant", +] + +def get_original_qa(d: Dict[str, Any]) -> Tuple[Optional[str], Optional[str]]: + """Extract original question and solution.""" + question = d.get("question") + solution = d.get("solution", d.get("answer")) + return question, solution + +def get_variant_qa(d: Dict[str, Any], variant_key: str) -> Tuple[Optional[str], Optional[str]]: + """Extract variant question and solution.""" + variants = d.get("variants") + if not isinstance(variants, dict): + return None, None + var = variants.get(variant_key) + if not isinstance(var, dict): + return None, None + question = var.get("question") + solution = var.get("solution", var.get("answer")) + return question, solution + +def load_dataset(data_dir: str, selected_variants: Optional[List[str]] = None) -> Generator[Dict[str, Any], None, None]: + """ + Iterates over all JSON files in data_dir and yields problem instances. + Each instance is a dict with keys: file_index, type, variant, question, solution. + + Args: + data_dir: Path to the dataset directory. + selected_variants: List of variants to include. If None, include all. + Supported values are in SUPPORTED_VARIANTS. + """ + if not os.path.isdir(data_dir): + raise ValueError(f"Directory not found: {data_dir}") + + # Validate selected_variants + if selected_variants: + for v in selected_variants: + if v not in SUPPORTED_VARIANTS: + print(f"Warning: Variant '{v}' not recognized. Supported: {SUPPORTED_VARIANTS}") + + # If no filter provided, use all supported + target_variants = selected_variants if selected_variants else SUPPORTED_VARIANTS + + files = [f for f in os.listdir(data_dir) if f.lower().endswith(".json")] + files.sort() + + for f in files: + filepath = os.path.join(data_dir, f) + try: + with open(filepath, "r", encoding="utf-8") as fp: + data = json.load(fp) + except Exception as e: + print(f"Error loading {filepath}: {e}") + continue + + file_index = data.get("index", f) # Use filename as index if 'index' key missing + prob_type = data.get("problem_type", "unknown") + + # 1. Original + if "original" in target_variants: + q, a = get_original_qa(data) + if q and a: + yield { + "file_index": file_index, + "problem_type": prob_type, + "variant": "original", + "question": q, + "solution": a + } + + # 2. Variants + for var_key in SUPPORTED_VARIANTS: + if var_key == "original": continue + if var_key not in target_variants: continue + + q, a = get_variant_qa(data, var_key) + if q and a: + yield { + "file_index": file_index, + "problem_type": prob_type, + "variant": var_key, + "question": q, + "solution": a + } diff --git a/putnamsup/putnamgap_viewer.py b/putnamsup/putnamgap_viewer.py new file mode 100644 index 0000000..d3678f1 --- /dev/null +++ b/putnamsup/putnamgap_viewer.py @@ -0,0 +1,277 @@ +#!/usr/bin/env python3 +""" +Streamlit-based PutnamGAP dataset viewer. + +Features: +- Scans preprocess/PutnamGAP for JSON files and allows prev/next navigation +- Select specific file from a dropdown +- Choose which variant to display: original or one of: + descriptive_long, descriptive_long_confusing, descriptive_long_misleading, garbled_string, kernel_variant +- Toggle to show Question, Solution (a.k.a. Answer), or Both +- TeX rendering via Markdown by default, with optional HTML+MathJax fallback +""" +import json +import os +from typing import Any, Dict, List, Optional, Tuple + +import streamlit as st +from streamlit.components.v1 import html as st_html + + +DATA_DIR = os.path.join(os.path.dirname(__file__), "PutnamGAP") +SUPPORTED_VARIANTS = [ + "original", + "descriptive_long", + "descriptive_long_confusing", + "descriptive_long_misleading", + "garbled_string", + "kernel_variant", +] + + +def discover_json_files(data_dir: str) -> List[str]: + if not os.path.isdir(data_dir): + return [] + files = [ + os.path.join(data_dir, f) + for f in os.listdir(data_dir) + if f.lower().endswith(".json") + ] + files.sort() + return files + + +def load_json(filepath: str) -> Dict[str, Any]: + with open(filepath, "r", encoding="utf-8") as f: + return json.load(f) + + +def get_original_qa(d: Dict[str, Any]) -> Tuple[Optional[str], Optional[str]]: + # Prefer "question"/"solution"; gracefully fall back to "answer" if present + question: Optional[str] = d.get("question") + solution: Optional[str] = d.get("solution", d.get("answer")) + return question, solution + + +def get_variant_qa( + d: Dict[str, Any], variant_key: str +) -> Tuple[Optional[str], Optional[str]]: + variants = d.get("variants") + if not isinstance(variants, dict): + return None, None + var = variants.get(variant_key) + if not isinstance(var, dict): + return None, None + question: Optional[str] = var.get("question") + solution: Optional[str] = var.get("solution", var.get("answer")) + return question, solution + + +def render_markdown_with_math(text: str) -> None: + # Streamlit markdown supports MathJax ($...$, $$...$$) + st.markdown(text, unsafe_allow_html=True) + + +def render_with_mathjax_html(blocks: List[Tuple[str, str]]) -> None: + """ + Render content with MathJax v3 inside a single HTML component. + blocks: list of (heading, content) tuples + """ + # Build a small HTML page with MathJax v3; render all blocks together. + content_sections = [] + for heading, content in blocks: + section_html = f""" + <section style="margin-bottom: 1.25rem;"> + <h3 style="margin: 0 0 .5rem 0; font-family: ui-sans-serif, system-ui, -apple-system, Segoe UI, Roboto, Helvetica, Arial;"> + {heading} + </h3> + <div class="mj-content">{content}</div> + </section> + """ + content_sections.append(section_html) + + page = f""" +<!DOCTYPE html> +<html> + <head> + <meta charset="utf-8"> + <meta name="viewport" content="width=device-width, initial-scale=1"> + <script> + window.MathJax = {{ + tex: {{ + inlineMath: [['$', '$'], ['\\\\(', '\\\\)']], + displayMath: [['$$', '$$'], ['\\\\[', '\\\\]']] + }}, + svg: {{ fontCache: 'global' }} + }}; + </script> + <script src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-svg.js" async></script> + <style> + html, body {{ + background: #0f0f10; + color: #f5f6f7; + }} + body {{ + padding: 0.5rem 0.25rem; + color: #f5f6f7; + background: #0f0f10; + }} + .mj-content {{ + line-height: 1.6; + white-space: pre-wrap; + word-wrap: break-word; + font-family: ui-sans-serif, system-ui, -apple-system, Segoe UI, Roboto, Helvetica, Arial; + font-size: 1rem; + color: #f5f6f7; + background: #0f0f10; + padding: 0.25rem 0.25rem; + border-radius: 4px; + }} + code, pre {{ + font-family: ui-monospace, SFMono-Regular, Menlo, Monaco, Consolas, "Liberation Mono", "Courier New", monospace; + color: #e6e6e6; + }} + svg, .MathJax, .mjx-svg, .mjx-mrow {{ + color: #f5f6f7; + }} + </style> + </head> + <body> + {''.join(content_sections)} + </body> +</html> +""" + st_html(page, height=600, scrolling=True) + + +def main() -> None: + st.set_page_config(page_title="PutnamGAP Viewer", layout="wide") + st.title("PutnamGAP 数据可视化与校对") + st.caption("浏览原题与不同变体;支持 TeX 渲染与文件前后切换。") + + files = discover_json_files(DATA_DIR) + if not files: + st.error(f"未在目录中发现 JSON 文件:{DATA_DIR}") + st.stop() + + # Sidebar controls + with st.sidebar: + st.subheader("文件与显示设置") + + # Single source of truth for navigation: file_index + file_labels = [os.path.basename(p) for p in files] + if "file_index" not in st.session_state: + st.session_state.file_index = 0 + + selected_label = st.selectbox( + "选择题目文件", + options=file_labels, + index=st.session_state.file_index, + ) + # Sync file_index if user chose a different label + current_index = file_labels.index(selected_label) + if current_index != st.session_state.file_index: + st.session_state.file_index = current_index + + # Variant selection + variant_human_labels = { + "original": "原题 original", + "descriptive_long": "descriptive_long", + "descriptive_long_confusing": "descriptive_long_confusing", + "descriptive_long_misleading": "descriptive_long_misleading", + "garbled_string": "garbled_string", + "kernel_variant": "kernel_variant", + } + variant_choice_label = st.radio( + "选择显示内容", + options=[variant_human_labels[k] for k in SUPPORTED_VARIANTS], + index=0, + ) + # Reverse map to internal key + selected_variant = { + v: k for k, v in variant_human_labels.items() + }[variant_choice_label] + + # Display options + show_mode = st.radio( + "显示部分", + options=["Question", "Solution", "Both"], + index=0, + horizontal=True, + ) + render_mode = st.radio( + "渲染方式", + options=["Markdown (默认)", "HTML + MathJax"], + index=1, + ) + show_meta = st.checkbox("显示原始 JSON 片段", value=False) + + # Prev/Next navigation buttons in header row + left, mid, right = st.columns([1, 6, 1]) + with left: + if st.button("⬅️ 上一题", use_container_width=True): + new_index = (st.session_state.file_index - 1) % len(files) + st.session_state.file_index = new_index + st.rerun() + with right: + if st.button("下一题 ➡️", use_container_width=True): + new_index = (st.session_state.file_index + 1) % len(files) + st.session_state.file_index = new_index + st.rerun() + + current_path = files[st.session_state.file_index] + data = load_json(current_path) + + st.write(f"当前文件:`{os.path.basename(current_path)}` ({st.session_state.file_index + 1}/{len(files)})") + st.divider() + + # Resolve question/solution for chosen variant + if selected_variant == "original": + q_text, s_text = get_original_qa(data) + else: + q_text, s_text = get_variant_qa(data, selected_variant) + + # Assemble content blocks to render + blocks: List[Tuple[str, str]] = [] + if show_mode in ("Question", "Both"): + if q_text: + blocks.append(("Question", q_text)) + else: + st.warning("该选择下未找到 Question。") + if show_mode in ("Solution", "Both"): + if s_text: + blocks.append(("Solution", s_text)) + else: + st.warning("该选择下未找到 Solution/Answer。") + + if len(blocks) > 0: + if render_mode.startswith("Markdown"): + for heading, content in blocks: + st.subheader(heading) + render_markdown_with_math(content) + st.markdown("---") + else: + render_with_mathjax_html(blocks) + else: + st.info("无可显示内容。") + + if show_meta: + with st.expander("原始 JSON(截断显示)", expanded=False): + # Show a trimmed version to avoid overwhelming the UI + preview: Dict[str, Any] = {} + for k in ("index", "type", "tag", "difficulty", "problem_type"): + if k in data: + preview[k] = data[k] + preview["keys"] = list(data.keys()) + st.json(preview) + + st.caption("完整 JSON 路径:") + st.code(current_path) + + st.caption("提示:可以使用侧边栏选择具体文件与变体,也可通过顶部按钮快速前后切换 JSON。") + + +if __name__ == "__main__": + main() + + diff --git a/putnamsup/requirements.txt b/putnamsup/requirements.txt new file mode 100644 index 0000000..981cde1 --- /dev/null +++ b/putnamsup/requirements.txt @@ -0,0 +1,7 @@ +torch>=2.0.0 +transformers>=4.37.0 +accelerate>=0.26.0 +tqdm>=4.66.0 +openai>=1.0.0 +streamlit>=1.30.0 + diff --git a/putnamsup/run_putnam_gap.py b/putnamsup/run_putnam_gap.py new file mode 100644 index 0000000..73f0ef6 --- /dev/null +++ b/putnamsup/run_putnam_gap.py @@ -0,0 +1,167 @@ +import os +import argparse +import torch +import time +from transformers import AutoModelForCausalLM, AutoTokenizer +from tqdm import tqdm +from putnam_utils import load_dataset, SUPPORTED_VARIANTS +import json + +def run_inference_batch(model, tokenizer, questions: list, device: str) -> list: + """ + Runs generation for a batch of questions. + """ + prompts = [f"Problem:\n{q}\n\nPlease solve the problem above step by step and provide the final answer.\n\nSolution:\n" for q in questions] + + # Determine target device for inputs + if device == "auto": + target_device = model.device + else: + target_device = device + + input_texts = [] + if tokenizer.chat_template: + for p in prompts: + messages = [{"role": "user", "content": p}] + try: + formatted = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) + input_texts.append(formatted) + except Exception: + input_texts.append(p) + else: + input_texts = prompts + + # Tokenize with padding + inputs = tokenizer(input_texts, return_tensors="pt", padding=True, truncation=True).to(target_device) + + with torch.no_grad(): + output_ids = model.generate( + **inputs, + max_new_tokens=1024, + do_sample=False, + pad_token_id=tokenizer.pad_token_id + ) + + # Decode only new tokens + # output_ids contains input_ids + new_tokens. We need to slice. + # However, input lengths might vary due to padding. + # batch_decode usually decodes everything. + # A common trick is to decode everything and then strip the prompt, but prompts are different. + # Better: tokenizer.batch_decode(output_ids[:, inputs.input_ids.shape[1]:]) works if left-padded and consistent length? + # No, with left padding, the new tokens are at the end. + + generated_texts = tokenizer.batch_decode(output_ids[:, inputs.input_ids.shape[1]:], skip_special_tokens=True) + return [t.strip() for t in generated_texts] + +def main(): + parser = argparse.ArgumentParser(description="Run inference on PutnamGAP dataset") + parser.add_argument("--data_dir", type=str, default="PutnamGAP", help="Path to PutnamGAP JSON files") + parser.add_argument("--model_name_or_path", type=str, required=True, help="Hugging Face model name or path") + parser.add_argument("--output_file", type=str, default="putnam_gap_results.jsonl", help="Output file path") + parser.add_argument("--limit", type=int, default=None, help="Limit total number of problems to run") + parser.add_argument("--limit_per_variant", type=int, default=None, help="Limit number of problems per variant") + parser.add_argument("--batch_size", type=int, default=1, help="Batch size for inference") + parser.add_argument("--device", type=str, default="cuda" if torch.cuda.is_available() else "cpu", help="Device to run on (use 'auto' for multi-GPU)") + parser.add_argument("--dry_run", action="store_true", help="Only load data and print first few examples, do not load model") + parser.add_argument("--variants", type=str, default=None, help=f"Comma-separated list of variants to include. Choices: {','.join(SUPPORTED_VARIANTS)}") + args = parser.parse_args() + + # Parse variants argument + selected_variants = None + + # Diagnostic check for CUDA availability + if torch.cuda.device_count() > 0 and not torch.cuda.is_available(): + print("\n" + "!"*60) + print(f"WARNING: PyTorch detects {torch.cuda.device_count()} CUDA devices but cannot use them.") + print(f"torch.cuda.is_available() == False") + print(f"Current PyTorch version: {torch.__version__}") + print(f"Your driver probably supports an older CUDA version than this PyTorch build.") + print("!"*60 + "\n") + + if args.variants: + selected_variants = [v.strip() for v in args.variants.split(",")] + print(f"Filtering for variants: {selected_variants}") + + print(f"Scanning data from {args.data_dir}...") + dataset = list(load_dataset(args.data_dir, selected_variants=selected_variants)) + print(f"Found {len(dataset)} problem variants.") + + if args.limit_per_variant: + from collections import defaultdict + counts = defaultdict(int) + filtered_dataset = [] + for item in dataset: + v = item['variant'] + if counts[v] < args.limit_per_variant: + filtered_dataset.append(item) + counts[v] += 1 + dataset = filtered_dataset + print(f"Filtered to {len(dataset)} examples (max {args.limit_per_variant} per variant).") + + if args.dry_run: + if dataset: + print("\n--- Example 1 ---") + print(f"Index: {dataset[0]['file_index']}") + print(f"Variant: {dataset[0]['variant']}") + print(f"Question: {dataset[0]['question'][:200]}...") + print(f"Solution: {dataset[0]['solution'][:200]}...") + return + + print(f"Loading model: {args.model_name_or_path} on {args.device}") + + try: + tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path, trust_remote_code=True, padding_side='left') + if tokenizer.pad_token_id is None: + if tokenizer.eos_token_id is not None: + tokenizer.pad_token_id = tokenizer.eos_token_id + else: + tokenizer.pad_token_id = 0 + + # Determine dtype + torch_dtype = torch.float16 + if args.device == "cpu": + torch_dtype = torch.float32 + + model = AutoModelForCausalLM.from_pretrained( + args.model_name_or_path, + device_map=args.device, + trust_remote_code=True, + torch_dtype=torch_dtype + ) + except Exception as e: + print(f"Failed to load model: {e}") + return + + if args.limit: + dataset = dataset[:args.limit] + print(f"Limiting to first {args.limit} examples.") + + with open(args.output_file, "w", encoding="utf-8") as f_out: + batch_size = args.batch_size + for i in tqdm(range(0, len(dataset), batch_size), desc="Running Inference"): + batch = dataset[i : i + batch_size] + questions = [item["question"] for item in batch] + + try: + generated_answers = run_inference_batch(model, tokenizer, questions, args.device) + except Exception as e: + print(f"Error generating for batch starting at index {i}: {e}") + generated_answers = [f"<ERROR: {str(e)}>" for _ in batch] + + for item, ans in zip(batch, generated_answers): + result_entry = { + "file_index": item["file_index"], + "problem_type": item["problem_type"], + "variant": item["variant"], + "question": item["question"], + "solution": item["solution"], + "generated_solution": ans + } + + f_out.write(json.dumps(result_entry, ensure_ascii=False) + "\n") + f_out.flush() + + print(f"Done. Results saved to {args.output_file}") + +if __name__ == "__main__": + main() diff --git a/putnamsup/run_putnam_gap_openrouter.py b/putnamsup/run_putnam_gap_openrouter.py new file mode 100644 index 0000000..8a23141 --- /dev/null +++ b/putnamsup/run_putnam_gap_openrouter.py @@ -0,0 +1,124 @@ +import os +import json +import argparse +import asyncio +import time +from tqdm.asyncio import tqdm +from putnam_utils import load_dataset, SUPPORTED_VARIANTS + +try: + from openai import AsyncOpenAI +except ImportError: + AsyncOpenAI = None + +async def process_item(sem, client, model_name, item): + """ + Process a single item with semaphore for concurrency control. + """ + async with sem: + question = item["question"] + prompt = f"Problem:\n{question}\n\nPlease solve the problem above step by step and provide the final answer.\n\nSolution:\n" + messages = [{"role": "user", "content": prompt}] + + try: + # Call API asynchronously + completion = await client.chat.completions.create( + model=model_name, + messages=messages, + temperature=0.0, + max_tokens=2048, + extra_headers={ + "HTTP-Referer": "https://github.com/PutnamGAP", + "X-Title": "PutnamGAP Eval", + } + ) + generated_answer = completion.choices[0].message.content + except Exception as e: + generated_answer = f"<API ERROR: {str(e)}>" + + # Construct result entry + result_entry = { + "file_index": item["file_index"], + "problem_type": item["problem_type"], + "variant": item["variant"], + "question": question, + "solution": item["solution"], + "generated_solution": generated_answer, + "model": model_name + } + return result_entry + +async def run_async_inference(args, dataset): + if AsyncOpenAI is None: + print("Error: 'openai' library not found. Please install it via: pip install openai") + return + + if not args.api_key: + print("Error: API key not provided. Use --api_key or set OPENROUTER_API_KEY env var.") + return + + print(f"Initializing AsyncOpenAI client with base_url={args.base_url}") + client = AsyncOpenAI( + base_url=args.base_url, + api_key=args.api_key, + ) + + concurrency = args.concurrency + print(f"Running with concurrency: {concurrency}") + sem = asyncio.Semaphore(concurrency) + + tasks = [] + for item in dataset: + task = process_item(sem, client, args.model_name, item) + tasks.append(task) + + print(f"Starting {len(tasks)} tasks using model: {args.model_name}") + + with open(args.output_file, "w", encoding="utf-8") as f_out: + for future in tqdm(asyncio.as_completed(tasks), total=len(tasks), desc="Async Inference"): + result = await future + f_out.write(json.dumps(result, ensure_ascii=False) + "\n") + f_out.flush() + + print(f"Done. Results saved to {args.output_file}") + +def main(): + parser = argparse.ArgumentParser(description="Run inference on PutnamGAP dataset via OpenRouter (Async)") + parser.add_argument("--data_dir", type=str, default="PutnamGAP", help="Path to PutnamGAP JSON files") + parser.add_argument("--model_name", type=str, required=True, help="OpenRouter model name") + parser.add_argument("--output_file", type=str, default="putnam_gap_openrouter_results.jsonl", help="Output file path") + parser.add_argument("--limit", type=int, default=None, help="Limit number of problems to run (for testing)") + parser.add_argument("--concurrency", type=int, default=10, help="Number of concurrent requests") + parser.add_argument("--api_key", type=str, default=os.getenv("OPENROUTER_API_KEY"), help="OpenRouter API Key") + parser.add_argument("--base_url", type=str, default="https://openrouter.ai/api/v1", help="API Base URL") + parser.add_argument("--dry_run", action="store_true", help="Only load data and print info") + parser.add_argument("--variants", type=str, default=None, help=f"Comma-separated list of variants to include. Choices: {','.join(SUPPORTED_VARIANTS)}") + + args = parser.parse_args() + + # Parse variants argument + selected_variants = None + if args.variants: + selected_variants = [v.strip() for v in args.variants.split(",")] + print(f"Filtering for variants: {selected_variants}") + + print(f"Scanning data from {args.data_dir}...") + dataset = list(load_dataset(args.data_dir, selected_variants=selected_variants)) + print(f"Found {len(dataset)} problem variants.") + + if args.dry_run: + if dataset: + print("\n--- Example 1 ---") + print(f"Index: {dataset[0]['file_index']}") + print(f"Variant: {dataset[0]['variant']}") + print(f"Question: {dataset[0]['question'][:200]}...") + return + + if args.limit: + dataset = dataset[:args.limit] + print(f"Limiting to first {args.limit} examples.") + + asyncio.run(run_async_inference(args, dataset)) + +if __name__ == "__main__": + main() |
