diff options
| author | YurenHao0426 <blackhao0426@gmail.com> | 2026-02-04 18:59:35 -0600 |
|---|---|---|
| committer | YurenHao0426 <blackhao0426@gmail.com> | 2026-02-04 18:59:35 -0600 |
| commit | f1c2cc22d46a6976df3555391e667c7e61592fad (patch) | |
| tree | 0b37b52c8ff91042a742d3b3ec54542cb6d6e2f6 /README.md | |
Diffstat (limited to 'README.md')
| -rw-r--r-- | README.md | 196 |
1 files changed, 196 insertions, 0 deletions
diff --git a/README.md b/README.md new file mode 100644 index 0000000..f1695f6 --- /dev/null +++ b/README.md @@ -0,0 +1,196 @@ +# RLVR Floating-Point Precision Experiment + +This repository implements experiments to study the effects of floating-point precision (FP32 vs bf16) on RLVR (Reinforcement Learning with Verifiable Rewards) training. + +## Overview + +The experiment aims to verify three key hypotheses based on the RLVR Three-Gate Theory: + +1. **On-task performance is insensitive to floating-point noise**: Training task performance should be similar between FP32 and bf16 precision modes. + +2. **Off-task performance is sensitive to floating-point noise**: Out-of-distribution task performance should show higher variance in bf16 mode due to numerical noise accumulation. + +3. **KL divergence patterns differ by task type**: On-task KL should be constrained by DAPO's implicit leash (Gate I), while off-task KL may drift more in bf16 mode. + +## Project Structure + +``` +rl-floating-noise/ +├── config.py # Configuration definitions +├── train_rlvr.py # Training script with DAPO algorithm +├── eval_policy.py # Evaluation script (J_k, KL_k) +├── utils_math_eval.py # Math answer verification utilities +├── utils_kl.py # KL divergence computation utilities +├── utils_bf16_sparsity.py # bf16-aware update sparsity analysis +├── run_experiments.py # Experiment orchestration script +├── analyze_results.py # Results analysis and hypothesis testing +├── requirements.txt # Python dependencies +├── configs/ +│ └── eval_tasks_config.json # Evaluation task configurations +├── scripts/ +│ ├── prepare_data.py # Dataset preparation script +│ ├── run_training.sh # Single training job script +│ ├── run_evaluation.sh # Single evaluation job script +│ └── run_full_experiment.sh # Full experiment pipeline +├── data/ # Training and evaluation datasets +└── results/ # Experiment outputs + ├── train_logs/ # Training checkpoints and logs + ├── eval_metrics/ # Evaluation results + └── analysis/ # Analysis outputs and plots +``` + +## Installation + +```bash +# Create conda environment +conda create -n rlvr-fp python=3.10 -y +conda activate rlvr-fp + +# Install dependencies +pip install -r requirements.txt + +# Install VeRL (optional, for full DAPO implementation) +pip install git+https://github.com/volcengine/verl.git +``` + +## Quick Start + +### 1. Prepare Data + +Generate sample datasets for development: + +```bash +python scripts/prepare_data.py --output_dir ./data +``` + +For production experiments, download the actual datasets: +- DM (DAPO-Math-17k + MATH) +- AIME24, AIME25, AMC23, MATH-500 +- GSM8K, MMLU-STEM, HumanEval + +### 2. Run Single Training Job + +```bash +# Train with bf16 precision +python train_rlvr.py \ + --precision_mode bf16 \ + --seed 1 \ + --output_dir results/train_logs/bf16_seed1 \ + --train_dataset_path data/dm_train.json + +# Train with fp32 precision +python train_rlvr.py \ + --precision_mode fp32 \ + --seed 1 \ + --output_dir results/train_logs/fp32_seed1 \ + --train_dataset_path data/dm_train.json +``` + +### 3. Run Evaluation + +```bash +python eval_policy.py \ + --base_ckpt Qwen/Qwen2.5-Math-7B \ + --ft_ckpt results/train_logs/bf16_seed1/final_model \ + --eval_tasks_config configs/eval_tasks_config.json \ + --output_path results/eval_metrics/bf16_seed1.json \ + --eval_base +``` + +### 4. Run Full Experiment + +```bash +# Run complete experiment pipeline +bash scripts/run_full_experiment.sh + +# Or use Python orchestrator +python run_experiments.py --mode full --seeds 1 2 3 4 5 +``` + +### 5. Analyze Results + +```bash +python analyze_results.py \ + --results_dir results/eval_metrics \ + --output_dir results/analysis \ + --on_task dm_val \ + --off_task aime24 aime25 amc23 math500 mmlu_stem humaneval +``` + +## Precision Configurations + +### P-high (FP32) +- Master weights stored in FP32 +- Deterministic algorithms enabled +- Dropout disabled +- Minimal numerical noise + +### P-bf16 (Default RLVR) +- Master weights stored in bf16 +- Non-deterministic algorithms +- Dropout enabled +- Higher numerical noise (Gate III effects) + +## Key Metrics + +### Performance (J_k) +- Pass@1 accuracy for verifiable tasks +- Computed via Monte Carlo sampling + +### Performance Delta (ΔJ_k) +``` +ΔJ_k = J_k(θ_T) - J_k(θ_0) +``` + +### KL Divergence +``` +KL_k ≈ E_x E_y~π_θ [log π_θ(y|x) - log π_0(y|x)] +``` + +### bf16 Update Sparsity +``` +sparsity = 1 - |{i: |w_i - w'_i| > η·max(|w_i|, |w'_i|)}| / n +``` + +## Expected Results + +Based on RLVR theory predictions: + +| Metric | On-task | Off-task | +|--------|---------|----------| +| E[ΔJ] difference | Small (~0) | Variable | +| Var[ΔJ] (bf16 vs fp32) | Similar | bf16 >> fp32 | +| KL divergence | Constrained | Higher variance | +| bf16 sparsity | 36-92% | - | + +## Configuration + +### Training Hyperparameters (RLVR defaults) +- Model: Qwen2.5-Math-7B +- Algorithm: DAPO (clip-only, β=0) +- Batch size: 256 +- Learning rate: 1e-6 +- Training steps: 300 +- Rollouts per prompt: 16 + +### Evaluation Settings +- Temperature: 0.7 +- Top-p: 0.8 +- Max generation length: 2048-4096 (task-dependent) + +## Citation + +If you use this code, please cite the RLVR paper: + +```bibtex +@article{rlvr2024, + title={Reinforcement Learning with Verifiable Rewards}, + author={...}, + year={2024} +} +``` + +## License + +MIT License + |
