README.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196

# RLVR Floating-Point Precision Experiment

This repository implements experiments to study the effects of floating-point precision (FP32 vs bf16) on RLVR (Reinforcement Learning with Verifiable Rewards) training.

## Overview

The experiment aims to verify three key hypotheses based on the RLVR Three-Gate Theory:

1. **On-task performance is insensitive to floating-point noise**: Training task performance should be similar between FP32 and bf16 precision modes.

2. **Off-task performance is sensitive to floating-point noise**: Out-of-distribution task performance should show higher variance in bf16 mode due to numerical noise accumulation.

3. **KL divergence patterns differ by task type**: On-task KL should be constrained by DAPO's implicit leash (Gate I), while off-task KL may drift more in bf16 mode.

## Project Structure

```
rl-floating-noise/
├── config.py                    # Configuration definitions
├── train_rlvr.py               # Training script with DAPO algorithm
├── eval_policy.py              # Evaluation script (J_k, KL_k)
├── utils_math_eval.py          # Math answer verification utilities
├── utils_kl.py                 # KL divergence computation utilities
├── utils_bf16_sparsity.py      # bf16-aware update sparsity analysis
├── run_experiments.py          # Experiment orchestration script
├── analyze_results.py          # Results analysis and hypothesis testing
├── requirements.txt            # Python dependencies
├── configs/
│   └── eval_tasks_config.json  # Evaluation task configurations
├── scripts/
│   ├── prepare_data.py         # Dataset preparation script
│   ├── run_training.sh         # Single training job script
│   ├── run_evaluation.sh       # Single evaluation job script
│   └── run_full_experiment.sh  # Full experiment pipeline
├── data/                       # Training and evaluation datasets
└── results/                    # Experiment outputs
    ├── train_logs/             # Training checkpoints and logs
    ├── eval_metrics/           # Evaluation results
    └── analysis/               # Analysis outputs and plots
```

## Installation

```bash
# Create conda environment
conda create -n rlvr-fp python=3.10 -y
conda activate rlvr-fp

# Install dependencies
pip install -r requirements.txt

# Install VeRL (optional, for full DAPO implementation)
pip install git+https://github.com/volcengine/verl.git
```

## Quick Start

### 1. Prepare Data

Generate sample datasets for development:

```bash
python scripts/prepare_data.py --output_dir ./data
```

For production experiments, download the actual datasets:
- DM (DAPO-Math-17k + MATH)
- AIME24, AIME25, AMC23, MATH-500
- GSM8K, MMLU-STEM, HumanEval

### 2. Run Single Training Job

```bash
# Train with bf16 precision
python train_rlvr.py \
    --precision_mode bf16 \
    --seed 1 \
    --output_dir results/train_logs/bf16_seed1 \
    --train_dataset_path data/dm_train.json

# Train with fp32 precision
python train_rlvr.py \
    --precision_mode fp32 \
    --seed 1 \
    --output_dir results/train_logs/fp32_seed1 \
    --train_dataset_path data/dm_train.json
```

### 3. Run Evaluation

```bash
python eval_policy.py \
    --base_ckpt Qwen/Qwen2.5-Math-7B \
    --ft_ckpt results/train_logs/bf16_seed1/final_model \
    --eval_tasks_config configs/eval_tasks_config.json \
    --output_path results/eval_metrics/bf16_seed1.json \
    --eval_base
```

### 4. Run Full Experiment

```bash
# Run complete experiment pipeline
bash scripts/run_full_experiment.sh

# Or use Python orchestrator
python run_experiments.py --mode full --seeds 1 2 3 4 5
```

### 5. Analyze Results

```bash
python analyze_results.py \
    --results_dir results/eval_metrics \
    --output_dir results/analysis \
    --on_task dm_val \
    --off_task aime24 aime25 amc23 math500 mmlu_stem humaneval
```

## Precision Configurations

### P-high (FP32)
- Master weights stored in FP32
- Deterministic algorithms enabled
- Dropout disabled
- Minimal numerical noise

### P-bf16 (Default RLVR)
- Master weights stored in bf16
- Non-deterministic algorithms
- Dropout enabled
- Higher numerical noise (Gate III effects)

## Key Metrics

### Performance (J_k)
- Pass@1 accuracy for verifiable tasks
- Computed via Monte Carlo sampling

### Performance Delta (ΔJ_k)
```
ΔJ_k = J_k(θ_T) - J_k(θ_0)
```

### KL Divergence
```
KL_k ≈ E_x E_y~π_θ [log π_θ(y|x) - log π_0(y|x)]
```

### bf16 Update Sparsity
```
sparsity = 1 - |{i: |w_i - w'_i| > η·max(|w_i|, |w'_i|)}| / n
```

## Expected Results

Based on RLVR theory predictions:

| Metric | On-task | Off-task |
|--------|---------|----------|
| E[ΔJ] difference | Small (~0) | Variable |
| Var[ΔJ] (bf16 vs fp32) | Similar | bf16 >> fp32 |
| KL divergence | Constrained | Higher variance |
| bf16 sparsity | 36-92% | - |

## Configuration

### Training Hyperparameters (RLVR defaults)
- Model: Qwen2.5-Math-7B
- Algorithm: DAPO (clip-only, β=0)
- Batch size: 256
- Learning rate: 1e-6
- Training steps: 300
- Rollouts per prompt: 16

### Evaluation Settings
- Temperature: 0.7
- Top-p: 0.8
- Max generation length: 2048-4096 (task-dependent)

## Citation

If you use this code, please cite the RLVR paper:

```bibtex
@article{rlvr2024,
  title={Reinforcement Learning with Verifiable Rewards},
  author={...},
  year={2024}
}
```

## License

MIT License