Initial commit: RL floating-point noise projectHEAD main

author: YurenHao0426 <blackhao0426@gmail.com> 2026-02-04 18:59:35 -0600
committer: YurenHao0426 <blackhao0426@gmail.com> 2026-02-04 18:59:35 -0600
commit: f1c2cc22d46a6976df3555391e667c7e61592fad (patch)
tree: 0b37b52c8ff91042a742d3b3ec54542cb6d6e2f6 /README.md
1 files changed, 196 insertions, 0 deletions
diff --git a/README.md b/README.md
new file mode 100644
index 0000000..f1695f6
--- /dev/null
+++ b/README.md
@@ -0,0 +1,196 @@
+# RLVR Floating-Point Precision Experiment
+
+This repository implements experiments to study the effects of floating-point precision (FP32 vs bf16) on RLVR (Reinforcement Learning with Verifiable Rewards) training.
+
+## Overview
+
+The experiment aims to verify three key hypotheses based on the RLVR Three-Gate Theory:
+
+1. **On-task performance is insensitive to floating-point noise**: Training task performance should be similar between FP32 and bf16 precision modes.
+
+2. **Off-task performance is sensitive to floating-point noise**: Out-of-distribution task performance should show higher variance in bf16 mode due to numerical noise accumulation.
+
+3. **KL divergence patterns differ by task type**: On-task KL should be constrained by DAPO's implicit leash (Gate I), while off-task KL may drift more in bf16 mode.
+
+## Project Structure
+
+```
+rl-floating-noise/
+├── config.py                    # Configuration definitions
+├── train_rlvr.py               # Training script with DAPO algorithm
+├── eval_policy.py              # Evaluation script (J_k, KL_k)
+├── utils_math_eval.py          # Math answer verification utilities
+├── utils_kl.py                 # KL divergence computation utilities
+├── utils_bf16_sparsity.py      # bf16-aware update sparsity analysis
+├── run_experiments.py          # Experiment orchestration script
+├── analyze_results.py          # Results analysis and hypothesis testing
+├── requirements.txt            # Python dependencies
+├── configs/
+│   └── eval_tasks_config.json  # Evaluation task configurations
+├── scripts/
+│   ├── prepare_data.py         # Dataset preparation script
+│   ├── run_training.sh         # Single training job script
+│   ├── run_evaluation.sh       # Single evaluation job script
+│   └── run_full_experiment.sh  # Full experiment pipeline
+├── data/                       # Training and evaluation datasets
+└── results/                    # Experiment outputs
+    ├── train_logs/             # Training checkpoints and logs
+    ├── eval_metrics/           # Evaluation results
+    └── analysis/               # Analysis outputs and plots
+```
+
+## Installation
+
+```bash
+# Create conda environment
+conda create -n rlvr-fp python=3.10 -y
+conda activate rlvr-fp
+
+# Install dependencies
+pip install -r requirements.txt
+
+# Install VeRL (optional, for full DAPO implementation)
+pip install git+https://github.com/volcengine/verl.git
+```
+
+## Quick Start
+
+### 1. Prepare Data
+
+Generate sample datasets for development:
+
+```bash
+python scripts/prepare_data.py --output_dir ./data
+```
+
+For production experiments, download the actual datasets:
+- DM (DAPO-Math-17k + MATH)
+- AIME24, AIME25, AMC23, MATH-500
+- GSM8K, MMLU-STEM, HumanEval
+
+### 2. Run Single Training Job
+
+```bash
+# Train with bf16 precision
+python train_rlvr.py \
+    --precision_mode bf16 \
+    --seed 1 \
+    --output_dir results/train_logs/bf16_seed1 \
+    --train_dataset_path data/dm_train.json
+
+# Train with fp32 precision
+python train_rlvr.py \
+    --precision_mode fp32 \
+    --seed 1 \
+    --output_dir results/train_logs/fp32_seed1 \
+    --train_dataset_path data/dm_train.json
+```
+
+### 3. Run Evaluation
+
+```bash
+python eval_policy.py \
+    --base_ckpt Qwen/Qwen2.5-Math-7B \
+    --ft_ckpt results/train_logs/bf16_seed1/final_model \
+    --eval_tasks_config configs/eval_tasks_config.json \
+    --output_path results/eval_metrics/bf16_seed1.json \
+    --eval_base
+```
+
+### 4. Run Full Experiment
+
+```bash
+# Run complete experiment pipeline
+bash scripts/run_full_experiment.sh
+
+# Or use Python orchestrator
+python run_experiments.py --mode full --seeds 1 2 3 4 5
+```
+
+### 5. Analyze Results
+
+```bash
+python analyze_results.py \
+    --results_dir results/eval_metrics \
+    --output_dir results/analysis \
+    --on_task dm_val \
+    --off_task aime24 aime25 amc23 math500 mmlu_stem humaneval
+```
+
+## Precision Configurations
+
+### P-high (FP32)
+- Master weights stored in FP32
+- Deterministic algorithms enabled
+- Dropout disabled
+- Minimal numerical noise
+
+### P-bf16 (Default RLVR)
+- Master weights stored in bf16
+- Non-deterministic algorithms
+- Dropout enabled
+- Higher numerical noise (Gate III effects)
+
+## Key Metrics
+
+### Performance (J_k)
+- Pass@1 accuracy for verifiable tasks
+- Computed via Monte Carlo sampling
+
+### Performance Delta (ΔJ_k)
+```
+ΔJ_k = J_k(θ_T) - J_k(θ_0)
+```
+
+### KL Divergence
+```
+KL_k ≈ E_x E_y~π_θ [log π_θ(y|x) - log π_0(y|x)]
+```
+
+### bf16 Update Sparsity
+```
+sparsity = 1 - |{i: |w_i - w'_i| > η·max(|w_i|, |w'_i|)}| / n
+```
+
+## Expected Results
+
+Based on RLVR theory predictions:
+
+| Metric | On-task | Off-task |
+|--------|---------|----------|
+| E[ΔJ] difference | Small (~0) | Variable |
+| Var[ΔJ] (bf16 vs fp32) | Similar | bf16 >> fp32 |
+| KL divergence | Constrained | Higher variance |
+| bf16 sparsity | 36-92% | - |
+
+## Configuration
+
+### Training Hyperparameters (RLVR defaults)
+- Model: Qwen2.5-Math-7B
+- Algorithm: DAPO (clip-only, β=0)
+- Batch size: 256
+- Learning rate: 1e-6
+- Training steps: 300
+- Rollouts per prompt: 16
+
+### Evaluation Settings
+- Temperature: 0.7
+- Top-p: 0.8
+- Max generation length: 2048-4096 (task-dependent)
+
+## Citation
+
+If you use this code, please cite the RLVR paper:
+
+```bibtex
+@article{rlvr2024,
+  title={Reinforcement Learning with Verifiable Rewards},
+  author={...},
+  year={2024}
+}
+```
+
+## License
+
+MIT License
+
author	YurenHao0426 <blackhao0426@gmail.com>	2026-02-04 18:59:35 -0600
committer	YurenHao0426 <blackhao0426@gmail.com>	2026-02-04 18:59:35 -0600
commit	f1c2cc22d46a6976df3555391e667c7e61592fad (patch)
tree	0b37b52c8ff91042a742d3b3ec54542cb6d6e2f6 /README.md