summaryrefslogtreecommitdiff
path: root/README.md
diff options
context:
space:
mode:
Diffstat (limited to 'README.md')
-rw-r--r--README.md196
1 files changed, 196 insertions, 0 deletions
diff --git a/README.md b/README.md
new file mode 100644
index 0000000..f1695f6
--- /dev/null
+++ b/README.md
@@ -0,0 +1,196 @@
+# RLVR Floating-Point Precision Experiment
+
+This repository implements experiments to study the effects of floating-point precision (FP32 vs bf16) on RLVR (Reinforcement Learning with Verifiable Rewards) training.
+
+## Overview
+
+The experiment aims to verify three key hypotheses based on the RLVR Three-Gate Theory:
+
+1. **On-task performance is insensitive to floating-point noise**: Training task performance should be similar between FP32 and bf16 precision modes.
+
+2. **Off-task performance is sensitive to floating-point noise**: Out-of-distribution task performance should show higher variance in bf16 mode due to numerical noise accumulation.
+
+3. **KL divergence patterns differ by task type**: On-task KL should be constrained by DAPO's implicit leash (Gate I), while off-task KL may drift more in bf16 mode.
+
+## Project Structure
+
+```
+rl-floating-noise/
+├── config.py # Configuration definitions
+├── train_rlvr.py # Training script with DAPO algorithm
+├── eval_policy.py # Evaluation script (J_k, KL_k)
+├── utils_math_eval.py # Math answer verification utilities
+├── utils_kl.py # KL divergence computation utilities
+├── utils_bf16_sparsity.py # bf16-aware update sparsity analysis
+├── run_experiments.py # Experiment orchestration script
+├── analyze_results.py # Results analysis and hypothesis testing
+├── requirements.txt # Python dependencies
+├── configs/
+│ └── eval_tasks_config.json # Evaluation task configurations
+├── scripts/
+│ ├── prepare_data.py # Dataset preparation script
+│ ├── run_training.sh # Single training job script
+│ ├── run_evaluation.sh # Single evaluation job script
+│ └── run_full_experiment.sh # Full experiment pipeline
+├── data/ # Training and evaluation datasets
+└── results/ # Experiment outputs
+ ├── train_logs/ # Training checkpoints and logs
+ ├── eval_metrics/ # Evaluation results
+ └── analysis/ # Analysis outputs and plots
+```
+
+## Installation
+
+```bash
+# Create conda environment
+conda create -n rlvr-fp python=3.10 -y
+conda activate rlvr-fp
+
+# Install dependencies
+pip install -r requirements.txt
+
+# Install VeRL (optional, for full DAPO implementation)
+pip install git+https://github.com/volcengine/verl.git
+```
+
+## Quick Start
+
+### 1. Prepare Data
+
+Generate sample datasets for development:
+
+```bash
+python scripts/prepare_data.py --output_dir ./data
+```
+
+For production experiments, download the actual datasets:
+- DM (DAPO-Math-17k + MATH)
+- AIME24, AIME25, AMC23, MATH-500
+- GSM8K, MMLU-STEM, HumanEval
+
+### 2. Run Single Training Job
+
+```bash
+# Train with bf16 precision
+python train_rlvr.py \
+ --precision_mode bf16 \
+ --seed 1 \
+ --output_dir results/train_logs/bf16_seed1 \
+ --train_dataset_path data/dm_train.json
+
+# Train with fp32 precision
+python train_rlvr.py \
+ --precision_mode fp32 \
+ --seed 1 \
+ --output_dir results/train_logs/fp32_seed1 \
+ --train_dataset_path data/dm_train.json
+```
+
+### 3. Run Evaluation
+
+```bash
+python eval_policy.py \
+ --base_ckpt Qwen/Qwen2.5-Math-7B \
+ --ft_ckpt results/train_logs/bf16_seed1/final_model \
+ --eval_tasks_config configs/eval_tasks_config.json \
+ --output_path results/eval_metrics/bf16_seed1.json \
+ --eval_base
+```
+
+### 4. Run Full Experiment
+
+```bash
+# Run complete experiment pipeline
+bash scripts/run_full_experiment.sh
+
+# Or use Python orchestrator
+python run_experiments.py --mode full --seeds 1 2 3 4 5
+```
+
+### 5. Analyze Results
+
+```bash
+python analyze_results.py \
+ --results_dir results/eval_metrics \
+ --output_dir results/analysis \
+ --on_task dm_val \
+ --off_task aime24 aime25 amc23 math500 mmlu_stem humaneval
+```
+
+## Precision Configurations
+
+### P-high (FP32)
+- Master weights stored in FP32
+- Deterministic algorithms enabled
+- Dropout disabled
+- Minimal numerical noise
+
+### P-bf16 (Default RLVR)
+- Master weights stored in bf16
+- Non-deterministic algorithms
+- Dropout enabled
+- Higher numerical noise (Gate III effects)
+
+## Key Metrics
+
+### Performance (J_k)
+- Pass@1 accuracy for verifiable tasks
+- Computed via Monte Carlo sampling
+
+### Performance Delta (ΔJ_k)
+```
+ΔJ_k = J_k(θ_T) - J_k(θ_0)
+```
+
+### KL Divergence
+```
+KL_k ≈ E_x E_y~π_θ [log π_θ(y|x) - log π_0(y|x)]
+```
+
+### bf16 Update Sparsity
+```
+sparsity = 1 - |{i: |w_i - w'_i| > η·max(|w_i|, |w'_i|)}| / n
+```
+
+## Expected Results
+
+Based on RLVR theory predictions:
+
+| Metric | On-task | Off-task |
+|--------|---------|----------|
+| E[ΔJ] difference | Small (~0) | Variable |
+| Var[ΔJ] (bf16 vs fp32) | Similar | bf16 >> fp32 |
+| KL divergence | Constrained | Higher variance |
+| bf16 sparsity | 36-92% | - |
+
+## Configuration
+
+### Training Hyperparameters (RLVR defaults)
+- Model: Qwen2.5-Math-7B
+- Algorithm: DAPO (clip-only, β=0)
+- Batch size: 256
+- Learning rate: 1e-6
+- Training steps: 300
+- Rollouts per prompt: 16
+
+### Evaluation Settings
+- Temperature: 0.7
+- Top-p: 0.8
+- Max generation length: 2048-4096 (task-dependent)
+
+## Citation
+
+If you use this code, please cite the RLVR paper:
+
+```bibtex
+@article{rlvr2024,
+ title={Reinforcement Learning with Verifiable Rewards},
+ author={...},
+ year={2024}
+}
+```
+
+## License
+
+MIT License
+