Group-Entropy-Equalization/README.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163

# One-shot Entropy Minimization

[![paper](https://img.shields.io/badge/paper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https://arxiv.org/abs/2505.20282)
[![Model](https://img.shields.io/badge/Models/Dataset-fcd022?style=for-the-badge&logo=huggingface&logoColor=000)](https://huggingface.co/zgao3186/qwen25math7b-one-shot-em/)
[![Notion](https://img.shields.io/badge/Site-000000.svg?style=for-the-badge&logo=notion&logoColor=white)](https://www.notion.so/One-shot-Entropy-Minimization-202606db813b80639773f850f39246a5) 

### Installation

```bash
conda create -n one-shot-em python=3.10 -y
pip install -r requirements.txt
```

---

### Colab Quickstart (single-GPU, no DeepSpeed)

In Colab, use a smaller model first to verify end-to-end. Then scale up if VRAM allows.

```bash
!git clone https://github.com/YurenHao0426/gee.git
%cd /content/gee/Group-Entropy-Equalization
!pip -q install transformers==4.44.2 accelerate==0.33.0 peft==0.12.0 bitsandbytes==0.43.3 datasets==2.21.0 wandb==0.17.7 pyarrow==17.0.0
```

Create a small parquet if you don’t have one:

```python
import os, pandas as pd
os.makedirs("dataset/1shot_rlvr", exist_ok=True)
df = pd.DataFrame({"problem": [
    "What is 2 + 2?",
    "If x=3, compute x^2 + 2x + 1.",
    "The doctor is a ____.",
    "Factor 12.",
    "What is 7*8?",
]})
df_big = pd.concat([df]*256, ignore_index=True).iloc[:1280]
df_big.to_parquet("dataset/1shot_rlvr/pi1_r1280.parquet", index=False)
```

Run training (no DeepSpeed, no AMP to avoid Colab GradScaler quirks):

```bash
!python train.py \
  --model_name Qwen2.5-1.5B \
  --model_path Qwen/Qwen2.5-1.5B \
  --train_data dataset/1shot_rlvr/pi1_r1280.parquet \
  --effective_batch 4 --micro_batch_size 1 \
  --temperature 0.5 --learning_rate 2e-5 --sample_temp 0.5 \
  --max_steps 10 --log_steps 1 --save_steps 10 \
  --run_name colab_em10 --wandb_project one-shot-em \
  --no_deepspeed --mixed_precision no
```

Checkpoints are saved under `checkpoints/<model>/<run_name>/`.

---

### Group-wise Entropy Equalization (GEE)

GEE balances sensitive groups by:
- Group mass parity (push group probability mass toward target pi)
- Group entropy equalization (normalize and equalize per-group entropy)
- Optional anchors to keep global token-entropy and sensitive-union mass close to baseline

Default groups file: `groups/gender.json`.

Run on Colab (example):

```bash
!python train.py \
  --model_name Qwen2.5-1.5B \
  --model_path Qwen/Qwen2.5-1.5B \
  --train_data dataset/1shot_rlvr/pi1_r1280.parquet \
  --effective_batch 4 --micro_batch_size 1 \
  --temperature 0.5 --learning_rate 2e-5 --sample_temp 0.5 \
  --max_steps 15 --log_steps 1 --save_steps 5 \
  --run_name colab_gee15 --wandb_project one-shot-em \
  --no_deepspeed --mixed_precision no \
  --gee_enable --gee_groups_path groups/gender.json \
  --gee_alpha 1.0 --gee_beta 0.3 --gee_lambda 0.0 --gee_gamma 0.0 --gee_tau 1e-3 --gee_top_m 50
```

You can customize groups and target proportions in the JSON.

---

### Reproducing One-shot EM Training (SOTA)

```bash
accelerate launch train.py \
  --model_name Qwen2.5-Math-7B \
  --model_path /path/to/Qwen2.5-Math-7B \
  --train_data dataset/1shot_rlvr/pi1_r1280.parquet \
  --effective_batch 64 \
  --micro_batch_size 2 \
  --temperature 0.5 \
  --learning_rate 2e-5 \
  --max_steps 50 \
  --log_steps 1 \
  --save_steps 1 \
  --run_name one_shot \
  --wandb_project one-shot-em
```

---

### Reproducing Multi-shot EM Training

```bash
accelerate launch train.py \
  --model_name Qwen2.5-Math-7B \
  --model_path /path/to/Qwen2.5-Math-7B \
  --train_data dataset/numina/numina_00.parquet \
  --effective_batch 64 \
  --micro_batch_size 2 \
  --temperature 0.5 \
  --learning_rate 2e-5 \
  --max_steps 50 \
  --log_steps 1 \
  --save_steps 1 \
  --run_name multi_shot \
  --wandb_project one-shot-em
```

---

### Evaluation

```bash
cd Qwen2.5-Eval/evaluation
bash sh/eval_all_math.sh
```

---

### Acknowledgements

Our dataset references and builds upon the following open-source contributions:

- [NuminaMath-CoT](https://huggingface.co/datasets/AI-MO/NuminaMath-CoT)
- [DeepScaler](https://github.com/agentica-project/deepscaler)
- [One-shot RLVR](https://github.com/ypwang61/One-Shot-RLVR/) – for data selection strategies
- [Qwen2.5-Eval](https://github.com/QwenLM/Qwen2.5-Math/) – for evaluation benchmarks

We sincerely thank the authors and maintainers of these projects for their excellent contributions to the research community!


---

### Citation
```
@misc{gao2025oneshotentropyminimization,
      title={One-shot Entropy Minimization}, 
      author={Zitian Gao and Lynx Chen and Haoming Luo and Joey Zhou and Bryan Dai},
      year={2025},
      eprint={2505.20282},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.20282}, 
}
```