# RLVR Floating-Point Precision Experiment This repository implements experiments to study the effects of floating-point precision (FP32 vs bf16) on RLVR (Reinforcement Learning with Verifiable Rewards) training. ## Overview The experiment aims to verify three key hypotheses based on the RLVR Three-Gate Theory: 1. **On-task performance is insensitive to floating-point noise**: Training task performance should be similar between FP32 and bf16 precision modes. 2. **Off-task performance is sensitive to floating-point noise**: Out-of-distribution task performance should show higher variance in bf16 mode due to numerical noise accumulation. 3. **KL divergence patterns differ by task type**: On-task KL should be constrained by DAPO's implicit leash (Gate I), while off-task KL may drift more in bf16 mode. ## Project Structure ``` rl-floating-noise/ ├── config.py # Configuration definitions ├── train_rlvr.py # Training script with DAPO algorithm ├── eval_policy.py # Evaluation script (J_k, KL_k) ├── utils_math_eval.py # Math answer verification utilities ├── utils_kl.py # KL divergence computation utilities ├── utils_bf16_sparsity.py # bf16-aware update sparsity analysis ├── run_experiments.py # Experiment orchestration script ├── analyze_results.py # Results analysis and hypothesis testing ├── requirements.txt # Python dependencies ├── configs/ │ └── eval_tasks_config.json # Evaluation task configurations ├── scripts/ │ ├── prepare_data.py # Dataset preparation script │ ├── run_training.sh # Single training job script │ ├── run_evaluation.sh # Single evaluation job script │ └── run_full_experiment.sh # Full experiment pipeline ├── data/ # Training and evaluation datasets └── results/ # Experiment outputs ├── train_logs/ # Training checkpoints and logs ├── eval_metrics/ # Evaluation results └── analysis/ # Analysis outputs and plots ``` ## Installation ```bash # Create conda environment conda create -n rlvr-fp python=3.10 -y conda activate rlvr-fp # Install dependencies pip install -r requirements.txt # Install VeRL (optional, for full DAPO implementation) pip install git+https://github.com/volcengine/verl.git ``` ## Quick Start ### 1. Prepare Data Generate sample datasets for development: ```bash python scripts/prepare_data.py --output_dir ./data ``` For production experiments, download the actual datasets: - DM (DAPO-Math-17k + MATH) - AIME24, AIME25, AMC23, MATH-500 - GSM8K, MMLU-STEM, HumanEval ### 2. Run Single Training Job ```bash # Train with bf16 precision python train_rlvr.py \ --precision_mode bf16 \ --seed 1 \ --output_dir results/train_logs/bf16_seed1 \ --train_dataset_path data/dm_train.json # Train with fp32 precision python train_rlvr.py \ --precision_mode fp32 \ --seed 1 \ --output_dir results/train_logs/fp32_seed1 \ --train_dataset_path data/dm_train.json ``` ### 3. Run Evaluation ```bash python eval_policy.py \ --base_ckpt Qwen/Qwen2.5-Math-7B \ --ft_ckpt results/train_logs/bf16_seed1/final_model \ --eval_tasks_config configs/eval_tasks_config.json \ --output_path results/eval_metrics/bf16_seed1.json \ --eval_base ``` ### 4. Run Full Experiment ```bash # Run complete experiment pipeline bash scripts/run_full_experiment.sh # Or use Python orchestrator python run_experiments.py --mode full --seeds 1 2 3 4 5 ``` ### 5. Analyze Results ```bash python analyze_results.py \ --results_dir results/eval_metrics \ --output_dir results/analysis \ --on_task dm_val \ --off_task aime24 aime25 amc23 math500 mmlu_stem humaneval ``` ## Precision Configurations ### P-high (FP32) - Master weights stored in FP32 - Deterministic algorithms enabled - Dropout disabled - Minimal numerical noise ### P-bf16 (Default RLVR) - Master weights stored in bf16 - Non-deterministic algorithms - Dropout enabled - Higher numerical noise (Gate III effects) ## Key Metrics ### Performance (J_k) - Pass@1 accuracy for verifiable tasks - Computed via Monte Carlo sampling ### Performance Delta (ΔJ_k) ``` ΔJ_k = J_k(θ_T) - J_k(θ_0) ``` ### KL Divergence ``` KL_k ≈ E_x E_y~π_θ [log π_θ(y|x) - log π_0(y|x)] ``` ### bf16 Update Sparsity ``` sparsity = 1 - |{i: |w_i - w'_i| > η·max(|w_i|, |w'_i|)}| / n ``` ## Expected Results Based on RLVR theory predictions: | Metric | On-task | Off-task | |--------|---------|----------| | E[ΔJ] difference | Small (~0) | Variable | | Var[ΔJ] (bf16 vs fp32) | Similar | bf16 >> fp32 | | KL divergence | Constrained | Higher variance | | bf16 sparsity | 36-92% | - | ## Configuration ### Training Hyperparameters (RLVR defaults) - Model: Qwen2.5-Math-7B - Algorithm: DAPO (clip-only, β=0) - Batch size: 256 - Learning rate: 1e-6 - Training steps: 300 - Rollouts per prompt: 16 ### Evaluation Settings - Temperature: 0.7 - Top-p: 0.8 - Max generation length: 2048-4096 (task-dependent) ## Citation If you use this code, please cite the RLVR paper: ```bibtex @article{rlvr2024, title={Reinforcement Learning with Verifiable Rewards}, author={...}, year={2024} } ``` ## License MIT License