diff options
| author | YurenHao0426 <blackhao0426@gmail.com> | 2026-01-27 09:57:37 -0600 |
|---|---|---|
| committer | YurenHao0426 <blackhao0426@gmail.com> | 2026-01-27 09:57:37 -0600 |
| commit | dc801c07cf38b0c495686463e6ca6f871a64440e (patch) | |
| tree | 599f03114775921dbc472403c701f4a3a8ea188a /collaborativeagents/training/grpo_verl/run_verl_grpo.sh | |
| parent | e43b3f8aa36c198b95c1e46bea2eaf3893b13dc3 (diff) | |
Add collaborativeagents module and update gitignore
- Add collaborativeagents subproject with adapters, agents, and evaluation modules
- Update .gitignore to exclude large binary files (.whl, .tar), wandb logs, and results
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Diffstat (limited to 'collaborativeagents/training/grpo_verl/run_verl_grpo.sh')
| -rw-r--r-- | collaborativeagents/training/grpo_verl/run_verl_grpo.sh | 63 |
1 files changed, 63 insertions, 0 deletions
diff --git a/collaborativeagents/training/grpo_verl/run_verl_grpo.sh b/collaborativeagents/training/grpo_verl/run_verl_grpo.sh new file mode 100644 index 0000000..ede35ab --- /dev/null +++ b/collaborativeagents/training/grpo_verl/run_verl_grpo.sh @@ -0,0 +1,63 @@ +#!/bin/bash +export PYTHONPATH="/shared/storage-01/users/mehri2/verl:$PYTHONPATH" +set -x +HYDRA_FULL_ERROR=1 + +train_data="/shared/storage-01/users/mehri2/mem/collaborativeagents/training/grpo_verl/data/session_level_reflection_grpo_train.parquet" +model_path="/shared/storage-01/users/mehri2/LLaMA-Factory/saves/llama-3.1-8b-instruct/full/sft_session_level_reflection/checkpoint-628" +reward_fn_path="/shared/storage-01/users/mehri2/mem/collaborativeagents/training/grpo_verl/verl_reward_functions.py" + +max_prompt_length=2048 +max_response_length=1024 +train_batch_size=8 +n_generations=8 +# Effective batch size is 64 + +python3 -m verl.trainer.main_ppo \ + algorithm.adv_estimator=grpo \ + data.train_files="$train_data" \ + data.val_files="$train_data" \ + data.train_batch_size=$train_batch_size \ + data.max_prompt_length=$max_prompt_length \ + data.max_response_length=$max_response_length \ + data.filter_overlong_prompts=True \ + data.truncation='error' \ + data.prompt_key=prompt \ + data.reward_fn_key=data_source \ + actor_rollout_ref.model.path=$model_path \ + actor_rollout_ref.actor.optim.lr=1e-6 \ + actor_rollout_ref.model.use_remove_padding=True \ + actor_rollout_ref.actor.ppo_mini_batch_size=8 \ + actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=8 \ + actor_rollout_ref.actor.use_kl_loss=True \ + actor_rollout_ref.actor.kl_loss_coef=0.003 \ + actor_rollout_ref.actor.kl_loss_type=low_var_kl \ + actor_rollout_ref.actor.entropy_coeff=0 \ + actor_rollout_ref.model.enable_gradient_checkpointing=True \ + actor_rollout_ref.actor.fsdp_config.model_dtype=bfloat16 \ + actor_rollout_ref.actor.fsdp_config.param_offload=False \ + actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \ + actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \ + actor_rollout_ref.rollout.tensor_model_parallel_size=1 \ + actor_rollout_ref.rollout.name=vllm \ + actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \ + actor_rollout_ref.rollout.n=$n_generations \ + actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=8 \ + actor_rollout_ref.ref.fsdp_config.model_dtype=bfloat16 \ + actor_rollout_ref.ref.fsdp_config.param_offload=True \ + actor_rollout_ref.rollout.temperature=0.9 \ + actor_rollout_ref.rollout.top_p=0.9 \ + custom_reward_function.path=$reward_fn_path \ + custom_reward_function.name=compute_score \ + algorithm.use_kl_in_reward=False \ + trainer.critic_warmup=0 \ + trainer.val_before_train=False \ + trainer.logger='["console","wandb"]' \ + trainer.project_name='collaborative-agent-reflection-grpo' \ + trainer.experiment_name='llama3.1-8b-verl-grpo-v3' \ + trainer.n_gpus_per_node=4 \ + trainer.nnodes=1 \ + trainer.save_freq=50 \ + trainer.test_freq=100 \ + trainer.total_epochs=1 \ + trainer.default_local_dir=/shared/storage-01/users/mehri2/mem/collaborativeagents/training/grpo_verl/results/v3 $@
\ No newline at end of file |
