# RLVR Floating-Point Precision Experiment

This repository implements experiments to study the effects of floating-point precision (FP32 vs bf16) on RLVR (Reinforcement Learning with Verifiable Rewards) training.

## Overview

The experiment aims to verify three key hypotheses based on the RLVR Three-Gate Theory:

1. **On-task performance is insensitive to floating-point noise**: Training task performance should be similar between FP32 and bf16 precision modes.

2. **Off-task performance is sensitive to floating-point noise**: Out-of-distribution task performance should show higher variance in bf16 mode due to numerical noise accumulation.

3. **KL divergence patterns differ by task type**: On-task KL should be constrained by DAPO's implicit leash (Gate I), while off-task KL may drift more in bf16 mode.

## Project Structure

```
rl-floating-noise/
├── config.py                    # Configuration definitions
├── train_rlvr.py               # Training script with DAPO algorithm
├── eval_policy.py              # Evaluation script (J_k, KL_k)
├── utils_math_eval.py          # Math answer verification utilities
├── utils_kl.py                 # KL divergence computation utilities
├── utils_bf16_sparsity.py      # bf16-aware update sparsity analysis
├── run_experiments.py          # Experiment orchestration script
├── analyze_results.py          # Results analysis and hypothesis testing
├── requirements.txt            # Python dependencies
├── configs/
│   └── eval_tasks_config.json  # Evaluation task configurations
├── scripts/
│   ├── prepare_data.py         # Dataset preparation script
│   ├── run_training.sh         # Single training job script
│   ├── run_evaluation.sh       # Single evaluation job script
│   └── run_full_experiment.sh  # Full experiment pipeline
├── data/                       # Training and evaluation datasets
└── results/                    # Experiment outputs
    ├── train_logs/             # Training checkpoints and logs
    ├── eval_metrics/           # Evaluation results
    └── analysis/               # Analysis outputs and plots
```

## Installation

```bash
# Create conda environment
conda create -n rlvr-fp python=3.10 -y
conda activate rlvr-fp

# Install dependencies
pip install -r requirements.txt

# Install VeRL (optional, for full DAPO implementation)
pip install git+https://github.com/volcengine/verl.git
```

## Quick Start

### 1. Prepare Data

Generate sample datasets for development:

```bash
python scripts/prepare_data.py --output_dir ./data
```

For production experiments, download the actual datasets:
- DM (DAPO-Math-17k + MATH)
- AIME24, AIME25, AMC23, MATH-500
- GSM8K, MMLU-STEM, HumanEval

### 2. Run Single Training Job

```bash
# Train with bf16 precision
python train_rlvr.py \
    --precision_mode bf16 \
    --seed 1 \
    --output_dir results/train_logs/bf16_seed1 \
    --train_dataset_path data/dm_train.json

# Train with fp32 precision
python train_rlvr.py \
    --precision_mode fp32 \
    --seed 1 \
    --output_dir results/train_logs/fp32_seed1 \
    --train_dataset_path data/dm_train.json
```

### 3. Run Evaluation

```bash
python eval_policy.py \
    --base_ckpt Qwen/Qwen2.5-Math-7B \
    --ft_ckpt results/train_logs/bf16_seed1/final_model \
    --eval_tasks_config configs/eval_tasks_config.json \
    --output_path results/eval_metrics/bf16_seed1.json \
    --eval_base
```

### 4. Run Full Experiment

```bash
# Run complete experiment pipeline
bash scripts/run_full_experiment.sh

# Or use Python orchestrator
python run_experiments.py --mode full --seeds 1 2 3 4 5
```

### 5. Analyze Results

```bash
python analyze_results.py \
    --results_dir results/eval_metrics \
    --output_dir results/analysis \
    --on_task dm_val \
    --off_task aime24 aime25 amc23 math500 mmlu_stem humaneval
```

## Precision Configurations

### P-high (FP32)
- Master weights stored in FP32
- Deterministic algorithms enabled
- Dropout disabled
- Minimal numerical noise

### P-bf16 (Default RLVR)
- Master weights stored in bf16
- Non-deterministic algorithms
- Dropout enabled
- Higher numerical noise (Gate III effects)

## Key Metrics

### Performance (J_k)
- Pass@1 accuracy for verifiable tasks
- Computed via Monte Carlo sampling

### Performance Delta (ΔJ_k)
```
ΔJ_k = J_k(θ_T) - J_k(θ_0)
```

### KL Divergence
```
KL_k ≈ E_x E_y~π_θ [log π_θ(y|x) - log π_0(y|x)]
```

### bf16 Update Sparsity
```
sparsity = 1 - |{i: |w_i - w'_i| > η·max(|w_i|, |w'_i|)}| / n
```

## Expected Results

Based on RLVR theory predictions:

| Metric | On-task | Off-task |
|--------|---------|----------|
| E[ΔJ] difference | Small (~0) | Variable |
| Var[ΔJ] (bf16 vs fp32) | Similar | bf16 >> fp32 |
| KL divergence | Constrained | Higher variance |
| bf16 sparsity | 36-92% | - |

## Configuration

### Training Hyperparameters (RLVR defaults)
- Model: Qwen2.5-Math-7B
- Algorithm: DAPO (clip-only, β=0)
- Batch size: 256
- Learning rate: 1e-6
- Training steps: 300
- Rollouts per prompt: 16

### Evaluation Settings
- Temperature: 0.7
- Top-p: 0.8
- Max generation length: 2048-4096 (task-dependent)

## Citation

If you use this code, please cite the RLVR paper:

```bibtex
@article{rlvr2024,
  title={Reinforcement Learning with Verifiable Rewards},
  author={...},
  year={2024}
}
```

## License

MIT License