diff options
Diffstat (limited to 'logs')
| -rw-r--r-- | logs/grpo_reflection_15498033.err | 66 | ||||
| -rw-r--r-- | logs/grpo_reflection_15498033.out | 83 |
2 files changed, 149 insertions, 0 deletions
diff --git a/logs/grpo_reflection_15498033.err b/logs/grpo_reflection_15498033.err new file mode 100644 index 0000000..1434e29 --- /dev/null +++ b/logs/grpo_reflection_15498033.err @@ -0,0 +1,66 @@ +/u/yurenh2/miniforge3/envs/eval/lib/python3.11/site-packages/transformers/utils/hub.py:110: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead. + warnings.warn( +[0;36m(APIServer pid=1914169)[0;0m The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored. +[0;36m(APIServer pid=1914169)[0;0m
Parse safetensors files: 0%| | 0/9 [00:00<?, ?it/s]
Parse safetensors files: 11%|█ | 1/9 [00:00<00:00, 8.22it/s]
Parse safetensors files: 100%|██████████| 9/9 [00:00<00:00, 49.13it/s] +/u/yurenh2/miniforge3/envs/eval/lib/python3.11/site-packages/transformers/utils/hub.py:110: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead. + warnings.warn( +/u/yurenh2/miniforge3/envs/eval/lib/python3.11/site-packages/transformers/utils/hub.py:110: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead. + warnings.warn( +/u/yurenh2/miniforge3/envs/eval/lib/python3.11/site-packages/transformers/utils/hub.py:110: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead. + warnings.warn( +[0;36m(Worker_TP0 pid=1914626)[0;0m
Loading safetensors checkpoint shards: 0% Completed | 0/9 [00:00<?, ?it/s] +[0;36m(Worker_TP0 pid=1914626)[0;0m
Loading safetensors checkpoint shards: 11% Completed | 1/9 [00:01<00:13, 1.67s/it] +[0;36m(Worker_TP0 pid=1914626)[0;0m
Loading safetensors checkpoint shards: 22% Completed | 2/9 [00:03<00:11, 1.71s/it] +[0;36m(Worker_TP0 pid=1914626)[0;0m
Loading safetensors checkpoint shards: 33% Completed | 3/9 [00:04<00:08, 1.48s/it] +[0;36m(Worker_TP0 pid=1914626)[0;0m
Loading safetensors checkpoint shards: 44% Completed | 4/9 [00:06<00:07, 1.57s/it] +[0;36m(Worker_TP0 pid=1914626)[0;0m
Loading safetensors checkpoint shards: 56% Completed | 5/9 [00:08<00:06, 1.63s/it] +[0;36m(Worker_TP0 pid=1914626)[0;0m
Loading safetensors checkpoint shards: 67% Completed | 6/9 [00:09<00:04, 1.67s/it] +[0;36m(Worker_TP0 pid=1914626)[0;0m
Loading safetensors checkpoint shards: 78% Completed | 7/9 [00:11<00:03, 1.65s/it] +[0;36m(Worker_TP0 pid=1914626)[0;0m
Loading safetensors checkpoint shards: 89% Completed | 8/9 [00:12<00:01, 1.31s/it] +[0;36m(Worker_TP0 pid=1914626)[0;0m
Loading safetensors checkpoint shards: 100% Completed | 9/9 [00:13<00:00, 1.43s/it] +[0;36m(Worker_TP0 pid=1914626)[0;0m
Loading safetensors checkpoint shards: 100% Completed | 9/9 [00:13<00:00, 1.52s/it] +[0;36m(Worker_TP0 pid=1914626)[0;0m +[0;36m(Worker_TP0 pid=1914626)[0;0m
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 0%| | 0/51 [00:00<?, ?it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 2%|▏ | 1/51 [00:00<00:19, 2.53it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 4%|▍ | 2/51 [00:00<00:15, 3.14it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 6%|▌ | 3/51 [00:00<00:13, 3.56it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 8%|▊ | 4/51 [00:01<00:12, 3.82it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 10%|▉ | 5/51 [00:01<00:11, 4.10it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 12%|█▏ | 6/51 [00:01<00:10, 4.27it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 14%|█▎ | 7/51 [00:01<00:10, 4.30it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 16%|█▌ | 8/51 [00:01<00:09, 4.44it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 18%|█▊ | 9/51 [00:02<00:08, 4.67it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 20%|█▉ | 10/51 [00:02<00:08, 4.80it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 22%|██▏ | 11/51 [00:02<00:08, 4.92it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 24%|██▎ | 12/51 [00:02<00:07, 5.03it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 25%|██▌ | 13/51 [00:02<00:07, 5.28it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 27%|██▋ | 14/51 [00:03<00:06, 5.43it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 29%|██▉ | 15/51 [00:03<00:06, 5.56it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 31%|███▏ | 16/51 [00:03<00:06, 5.69it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 33%|███▎ | 17/51 [00:03<00:05, 5.99it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 35%|███▌ | 18/51 [00:03<00:05, 6.10it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 37%|███▋ | 19/51 [00:03<00:05, 6.25it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 39%|███▉ | 20/51 [00:04<00:04, 6.35it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 41%|████ | 21/51 [00:04<00:04, 6.47it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 43%|████▎ | 22/51 [00:04<00:04, 6.53it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 45%|████▌ | 23/51 [00:04<00:04, 6.62it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 47%|████▋ | 24/51 [00:04<00:04, 6.69it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 49%|████▉ | 25/51 [00:04<00:03, 7.02it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 51%|█████ | 26/51 [00:04<00:03, 7.10it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 53%|█████▎ | 27/51 [00:05<00:03, 7.26it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 55%|█████▍ | 28/51 [00:05<00:03, 7.37it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 57%|█████▋ | 29/51 [00:05<00:02, 7.52it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 59%|█████▉ | 30/51 [00:05<00:02, 7.62it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 61%|██████ | 31/51 [00:05<00:02, 7.73it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 63%|██████▎ | 32/51 [00:05<00:02, 7.82it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 65%|██████▍ | 33/51 [00:05<00:02, 8.22it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 67%|██████▋ | 34/51 [00:05<00:02, 8.23it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 69%|██████▊ | 35/51 [00:06<00:01, 8.42it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 71%|███████ | 36/51 [00:06<00:01, 8.57it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 73%|███████▎ | 37/51 [00:06<00:01, 8.67it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 75%|███████▍ | 38/51 [00:06<00:01, 8.84it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 76%|███████▋ | 39/51 [00:06<00:01, 9.01it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 78%|███████▊ | 40/51 [00:06<00:01, 9.02it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 82%|████████▏ | 42/51 [00:06<00:00, 9.99it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 86%|████████▋ | 44/51 [00:06<00:00, 10.51it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 90%|█████████ | 46/51 [00:07<00:00, 10.89it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 94%|█████████▍| 48/51 [00:07<00:00, 11.13it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 98%|█████████▊| 50/51 [00:07<00:00, 11.33it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████| 51/51 [00:07<00:00, 6.80it/s] +[0;36m(Worker_TP0 pid=1914626)[0;0m
Capturing CUDA graphs (decode, FULL): 0%| | 0/35 [00:00<?, ?it/s]
Capturing CUDA graphs (decode, FULL): 3%|▎ | 1/35 [00:00<00:06, 4.97it/s]
Capturing CUDA graphs (decode, FULL): 6%|▌ | 2/35 [00:00<00:05, 5.80it/s]
Capturing CUDA graphs (decode, FULL): 9%|▊ | 3/35 [00:00<00:05, 6.25it/s]
Capturing CUDA graphs (decode, FULL): 11%|█▏ | 4/35 [00:00<00:04, 6.49it/s]
Capturing CUDA graphs (decode, FULL): 14%|█▍ | 5/35 [00:00<00:04, 6.70it/s]
Capturing CUDA graphs (decode, FULL): 17%|█▋ | 6/35 [00:00<00:04, 6.84it/s]
Capturing CUDA graphs (decode, FULL): 20%|██ | 7/35 [00:01<00:04, 6.97it/s]
Capturing CUDA graphs (decode, FULL): 23%|██▎ | 8/35 [00:01<00:03, 7.07it/s]
Capturing CUDA graphs (decode, FULL): 26%|██▌ | 9/35 [00:01<00:03, 7.46it/s]
Capturing CUDA graphs (decode, FULL): 29%|██▊ | 10/35 [00:01<00:03, 7.55it/s]
Capturing CUDA graphs (decode, FULL): 31%|███▏ | 11/35 [00:01<00:03, 7.73it/s]
Capturing CUDA graphs (decode, FULL): 34%|███▍ | 12/35 [00:01<00:02, 7.86it/s]
Capturing CUDA graphs (decode, FULL): 37%|███▋ | 13/35 [00:01<00:02, 8.05it/s]
Capturing CUDA graphs (decode, FULL): 40%|████ | 14/35 [00:01<00:02, 8.17it/s]
Capturing CUDA graphs (decode, FULL): 43%|████▎ | 15/35 [00:02<00:02, 8.31it/s]
Capturing CUDA graphs (decode, FULL): 46%|████▌ | 16/35 [00:02<00:02, 8.43it/s]
Capturing CUDA graphs (decode, FULL): 51%|█████▏ | 18/35 [00:02<00:01, 9.01it/s]
Capturing CUDA graphs (decode, FULL): 54%|█████▍ | 19/35 [00:02<00:01, 9.19it/s]
Capturing CUDA graphs (decode, FULL): 57%|█████▋ | 20/35 [00:02<00:01, 9.35it/s]
Capturing CUDA graphs (decode, FULL): 63%|██████▎ | 22/35 [00:02<00:01, 9.70it/s]
Capturing CUDA graphs (decode, FULL): 69%|██████▊ | 24/35 [00:02<00:01, 10.01it/s]
Capturing CUDA graphs (decode, FULL): 74%|███████▍ | 26/35 [00:03<00:00, 10.69it/s]
Capturing CUDA graphs (decode, FULL): 80%|████████ | 28/35 [00:03<00:00, 11.15it/s]
Capturing CUDA graphs (decode, FULL): 86%|████████▌ | 30/35 [00:03<00:00, 11.50it/s]
Capturing CUDA graphs (decode, FULL): 91%|█████████▏| 32/35 [00:03<00:00, 11.79it/s]
Capturing CUDA graphs (decode, FULL): 97%|█████████▋| 34/35 [00:03<00:00, 12.00it/s]
Capturing CUDA graphs (decode, FULL): 100%|██████████| 35/35 [00:03<00:00, 9.12it/s] +[0;36m(APIServer pid=1914169)[0;0m INFO: Started server process [1914169] +[0;36m(APIServer pid=1914169)[0;0m INFO: Waiting for application startup. +[0;36m(APIServer pid=1914169)[0;0m INFO: Application startup complete. +/u/yurenh2/miniforge3/envs/eval/lib/python3.11/site-packages/transformers/utils/hub.py:110: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead. + warnings.warn( +/u/yurenh2/miniforge3/envs/eval/lib/python3.11/site-packages/trl/import_utils.py:91: UserWarning: TRL currently only supports vLLM version `0.10.2`. You have version 0.13.0 installed. We recommend to install this version to avoid compatibility issues. + warnings.warn( +/u/yurenh2/miniforge3/envs/eval/lib/python3.11/site-packages/trl/import_utils.py:91: UserWarning: TRL currently only supports vLLM version `0.10.2`. You have version 0.13.0 installed. We recommend to install this version to avoid compatibility issues. + warnings.warn( +Traceback (most recent call last): + File "/u/yurenh2/miniforge3/envs/eval/lib/python3.11/site-packages/trl/import_utils.py", line 156, in _get_module + return importlib.import_module("." + module_name, self.__name__) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/u/yurenh2/miniforge3/envs/eval/lib/python3.11/importlib/__init__.py", line 126, in import_module + return _bootstrap._gcd_import(name[level:], package, level) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "<frozen importlib._bootstrap>", line 1204, in _gcd_import + File "<frozen importlib._bootstrap>", line 1176, in _find_and_load + File "<frozen importlib._bootstrap>", line 1147, in _find_and_load_unlocked + File "<frozen importlib._bootstrap>", line 690, in _load_unlocked + File "<frozen importlib._bootstrap_external>", line 940, in exec_module + File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed + File "/u/yurenh2/miniforge3/envs/eval/lib/python3.11/site-packages/trl/trainer/grpo_trainer.py", line 85, in <module> + from vllm.sampling_params import GuidedDecodingParams +ImportError: cannot import name 'GuidedDecodingParams' from 'vllm.sampling_params' (/u/yurenh2/miniforge3/envs/eval/lib/python3.11/site-packages/vllm/sampling_params.py) + +The above exception was the direct cause of the following exception: + +Traceback (most recent call last): + File "/projects/bfqt/users/yurenh2/ml-projects/personalization-user-model/collaborativeagents/training/train_grpo.py", line 21, in <module> + from trl import GRPOConfig, GRPOTrainer + File "<frozen importlib._bootstrap>", line 1229, in _handle_fromlist + File "/u/yurenh2/miniforge3/envs/eval/lib/python3.11/site-packages/trl/import_utils.py", line 147, in __getattr__ + value = getattr(module, name) + ^^^^^^^^^^^^^^^^^^^^^ + File "/u/yurenh2/miniforge3/envs/eval/lib/python3.11/site-packages/trl/import_utils.py", line 146, in __getattr__ + module = self._get_module(self._class_to_module[name]) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/u/yurenh2/miniforge3/envs/eval/lib/python3.11/site-packages/trl/import_utils.py", line 158, in _get_module + raise RuntimeError( +RuntimeError: Failed to import trl.trainer.grpo_trainer because of the following error (look up to see its traceback): +cannot import name 'GuidedDecodingParams' from 'vllm.sampling_params' (/u/yurenh2/miniforge3/envs/eval/lib/python3.11/site-packages/vllm/sampling_params.py) diff --git a/logs/grpo_reflection_15498033.out b/logs/grpo_reflection_15498033.out new file mode 100644 index 0000000..92e7b2b --- /dev/null +++ b/logs/grpo_reflection_15498033.out @@ -0,0 +1,83 @@ +=== Starting vLLM judge server === +vLLM server PID: 1914169 +Waiting for vLLM server to start... +[0;36m(APIServer pid=1914169)[0;0m INFO 01-11 15:24:10 [api_server.py:1351] vLLM API server version 0.13.0 +[0;36m(APIServer pid=1914169)[0;0m INFO 01-11 15:24:10 [utils.py:253] non-default args: {'model': 'hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4', 'trust_remote_code': True, 'max_model_len': 8192, 'tensor_parallel_size': 2} +[0;36m(APIServer pid=1914169)[0;0m INFO 01-11 15:24:13 [model.py:514] Resolved architecture: LlamaForCausalLM +[0;36m(APIServer pid=1914169)[0;0m INFO 01-11 15:24:13 [model.py:1661] Using max model len 8192 +[0;36m(APIServer pid=1914169)[0;0m INFO 01-11 15:24:15 [awq_marlin.py:162] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel. +[0;36m(APIServer pid=1914169)[0;0m INFO 01-11 15:24:15 [scheduler.py:230] Chunked prefill is enabled with max_num_batched_tokens=2048. +[0;36m(EngineCore_DP0 pid=1914581)[0;0m INFO 01-11 15:24:30 [core.py:93] Initializing a V1 LLM engine (v0.13.0) with config: model='hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4', speculative_config=None, tokenizer='hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=awq_marlin, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False), seed=0, served_model_name=hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False}, 'local_cache_dir': None} +[0;36m(EngineCore_DP0 pid=1914581)[0;0m WARNING 01-11 15:24:30 [multiproc_executor.py:882] Reducing Torch parallelism from 16 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed. +INFO 01-11 15:24:42 [parallel_state.py:1203] world_size=2 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:49495 backend=nccl +INFO 01-11 15:24:42 [parallel_state.py:1203] world_size=2 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:49495 backend=nccl +INFO 01-11 15:24:42 [pynccl.py:111] vLLM is using nccl==2.27.5 +WARNING 01-11 15:24:42 [symm_mem.py:67] SymmMemCommunicator: Device capability 8.0 not supported, communicator is not available. +WARNING 01-11 15:24:42 [symm_mem.py:67] SymmMemCommunicator: Device capability 8.0 not supported, communicator is not available. +INFO 01-11 15:24:43 [parallel_state.py:1411] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 1, EP rank 1 +INFO 01-11 15:24:43 [parallel_state.py:1411] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0 +[0;36m(Worker_TP0 pid=1914626)[0;0m INFO 01-11 15:24:44 [gpu_model_runner.py:3562] Starting to load model hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4... +[0;36m(Worker_TP0 pid=1914626)[0;0m INFO 01-11 15:24:45 [cuda.py:351] Using FLASH_ATTN attention backend out of potential backends: ('FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION') +[0;36m(Worker_TP0 pid=1914626)[0;0m INFO 01-11 15:25:34 [weight_utils.py:487] Time spent downloading weights for hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4: 48.183306 seconds +[0;36m(Worker_TP0 pid=1914626)[0;0m INFO 01-11 15:25:47 [default_loader.py:308] Loading weights took 13.80 seconds +[0;36m(Worker_TP0 pid=1914626)[0;0m INFO 01-11 15:25:49 [gpu_model_runner.py:3659] Model loading took 18.5766 GiB memory and 64.704730 seconds +[0;36m(Worker_TP0 pid=1914626)[0;0m INFO 01-11 15:26:11 [backends.py:643] Using cache directory: /u/yurenh2/.cache/vllm/torch_compile_cache/6437ac94ed/rank_0_0/backbone for vLLM's torch.compile +[0;36m(Worker_TP0 pid=1914626)[0;0m INFO 01-11 15:26:11 [backends.py:703] Dynamo bytecode transform time: 20.38 s +[0;36m(Worker_TP1 pid=1914627)[0;0m INFO 01-11 15:26:25 [backends.py:261] Cache the graph of compile range (1, 2048) for later use +[0;36m(Worker_TP0 pid=1914626)[0;0m INFO 01-11 15:26:25 [backends.py:261] Cache the graph of compile range (1, 2048) for later use +[0;36m(Worker_TP0 pid=1914626)[0;0m INFO 01-11 15:26:34 [backends.py:278] Compiling a graph for compile range (1, 2048) takes 12.21 s +[0;36m(Worker_TP0 pid=1914626)[0;0m INFO 01-11 15:26:34 [monitor.py:34] torch.compile takes 32.59 s in total +[0;36m(Worker_TP0 pid=1914626)[0;0m INFO 01-11 15:26:35 [gpu_worker.py:375] Available KV cache memory: 15.70 GiB +[0;36m(EngineCore_DP0 pid=1914581)[0;0m INFO 01-11 15:26:35 [kv_cache_utils.py:1291] GPU KV cache size: 102,896 tokens +[0;36m(EngineCore_DP0 pid=1914581)[0;0m INFO 01-11 15:26:35 [kv_cache_utils.py:1296] Maximum concurrency for 8,192 tokens per request: 12.56x +[0;36m(Worker_TP1 pid=1914627)[0;0m INFO 01-11 15:26:47 [custom_all_reduce.py:216] Registering 13685 cuda graph addresses +[0;36m(Worker_TP0 pid=1914626)[0;0m INFO 01-11 15:26:47 [custom_all_reduce.py:216] Registering 13685 cuda graph addresses +[0;36m(Worker_TP0 pid=1914626)[0;0m INFO 01-11 15:26:48 [gpu_model_runner.py:4587] Graph capturing finished in 12 secs, took 1.40 GiB +[0;36m(EngineCore_DP0 pid=1914581)[0;0m INFO 01-11 15:26:48 [core.py:259] init engine (profile, create kv cache, warmup model) took 57.57 seconds +[0;36m(APIServer pid=1914169)[0;0m INFO 01-11 15:26:49 [api_server.py:1099] Supported tasks: ['generate'] +[0;36m(APIServer pid=1914169)[0;0m WARNING 01-11 15:26:49 [model.py:1487] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`. +[0;36m(APIServer pid=1914169)[0;0m INFO 01-11 15:26:49 [serving_responses.py:201] Using default chat sampling params from model: {'temperature': 0.6, 'top_p': 0.9} +[0;36m(APIServer pid=1914169)[0;0m INFO 01-11 15:26:49 [serving_chat.py:137] Using default chat sampling params from model: {'temperature': 0.6, 'top_p': 0.9} +[0;36m(APIServer pid=1914169)[0;0m INFO 01-11 15:26:49 [serving_completion.py:77] Using default completion sampling params from model: {'temperature': 0.6, 'top_p': 0.9} +[0;36m(APIServer pid=1914169)[0;0m INFO 01-11 15:26:50 [serving_chat.py:137] Using default chat sampling params from model: {'temperature': 0.6, 'top_p': 0.9} +[0;36m(APIServer pid=1914169)[0;0m INFO 01-11 15:26:50 [api_server.py:1425] Starting vLLM API server 0 on http://0.0.0.0:8000 +[0;36m(APIServer pid=1914169)[0;0m INFO 01-11 15:26:50 [launcher.py:38] Available routes are: +[0;36m(APIServer pid=1914169)[0;0m INFO 01-11 15:26:50 [launcher.py:46] Route: /openapi.json, Methods: HEAD, GET +[0;36m(APIServer pid=1914169)[0;0m INFO 01-11 15:26:50 [launcher.py:46] Route: /docs, Methods: HEAD, GET +[0;36m(APIServer pid=1914169)[0;0m INFO 01-11 15:26:50 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: HEAD, GET +[0;36m(APIServer pid=1914169)[0;0m INFO 01-11 15:26:50 [launcher.py:46] Route: /redoc, Methods: HEAD, GET +[0;36m(APIServer pid=1914169)[0;0m INFO 01-11 15:26:50 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST +[0;36m(APIServer pid=1914169)[0;0m INFO 01-11 15:26:50 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST +[0;36m(APIServer pid=1914169)[0;0m INFO 01-11 15:26:50 [launcher.py:46] Route: /tokenize, Methods: POST +[0;36m(APIServer pid=1914169)[0;0m INFO 01-11 15:26:50 [launcher.py:46] Route: /detokenize, Methods: POST +[0;36m(APIServer pid=1914169)[0;0m INFO 01-11 15:26:50 [launcher.py:46] Route: /inference/v1/generate, Methods: POST +[0;36m(APIServer pid=1914169)[0;0m INFO 01-11 15:26:50 [launcher.py:46] Route: /pause, Methods: POST +[0;36m(APIServer pid=1914169)[0;0m INFO 01-11 15:26:50 [launcher.py:46] Route: /resume, Methods: POST +[0;36m(APIServer pid=1914169)[0;0m INFO 01-11 15:26:50 [launcher.py:46] Route: /is_paused, Methods: GET +[0;36m(APIServer pid=1914169)[0;0m INFO 01-11 15:26:50 [launcher.py:46] Route: /metrics, Methods: GET +[0;36m(APIServer pid=1914169)[0;0m INFO 01-11 15:26:50 [launcher.py:46] Route: /health, Methods: GET +[0;36m(APIServer pid=1914169)[0;0m INFO 01-11 15:26:50 [launcher.py:46] Route: /load, Methods: GET +[0;36m(APIServer pid=1914169)[0;0m INFO 01-11 15:26:50 [launcher.py:46] Route: /v1/models, Methods: GET +[0;36m(APIServer pid=1914169)[0;0m INFO 01-11 15:26:50 [launcher.py:46] Route: /version, Methods: GET +[0;36m(APIServer pid=1914169)[0;0m INFO 01-11 15:26:50 [launcher.py:46] Route: /v1/responses, Methods: POST +[0;36m(APIServer pid=1914169)[0;0m INFO 01-11 15:26:50 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET +[0;36m(APIServer pid=1914169)[0;0m INFO 01-11 15:26:50 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST +[0;36m(APIServer pid=1914169)[0;0m INFO 01-11 15:26:50 [launcher.py:46] Route: /v1/messages, Methods: POST +[0;36m(APIServer pid=1914169)[0;0m INFO 01-11 15:26:50 [launcher.py:46] Route: /v1/chat/completions, Methods: POST +[0;36m(APIServer pid=1914169)[0;0m INFO 01-11 15:26:50 [launcher.py:46] Route: /v1/completions, Methods: POST +[0;36m(APIServer pid=1914169)[0;0m INFO 01-11 15:26:50 [launcher.py:46] Route: /v1/audio/transcriptions, Methods: POST +[0;36m(APIServer pid=1914169)[0;0m INFO 01-11 15:26:50 [launcher.py:46] Route: /v1/audio/translations, Methods: POST +[0;36m(APIServer pid=1914169)[0;0m INFO 01-11 15:26:50 [launcher.py:46] Route: /ping, Methods: GET +[0;36m(APIServer pid=1914169)[0;0m INFO 01-11 15:26:50 [launcher.py:46] Route: /ping, Methods: POST +[0;36m(APIServer pid=1914169)[0;0m INFO 01-11 15:26:50 [launcher.py:46] Route: /invocations, Methods: POST +[0;36m(APIServer pid=1914169)[0;0m INFO 01-11 15:26:50 [launcher.py:46] Route: /classify, Methods: POST +[0;36m(APIServer pid=1914169)[0;0m INFO 01-11 15:26:50 [launcher.py:46] Route: /v1/embeddings, Methods: POST +[0;36m(APIServer pid=1914169)[0;0m INFO 01-11 15:26:50 [launcher.py:46] Route: /score, Methods: POST +[0;36m(APIServer pid=1914169)[0;0m INFO 01-11 15:26:50 [launcher.py:46] Route: /v1/score, Methods: POST +[0;36m(APIServer pid=1914169)[0;0m INFO 01-11 15:26:50 [launcher.py:46] Route: /rerank, Methods: POST +[0;36m(APIServer pid=1914169)[0;0m INFO 01-11 15:26:50 [launcher.py:46] Route: /v1/rerank, Methods: POST +[0;36m(APIServer pid=1914169)[0;0m INFO 01-11 15:26:50 [launcher.py:46] Route: /v2/rerank, Methods: POST +[0;36m(APIServer pid=1914169)[0;0m INFO 01-11 15:26:50 [launcher.py:46] Route: /pooling, Methods: POST +[0;36m(APIServer pid=1914169)[0;0m INFO: 127.0.0.1:57116 - "GET /health HTTP/1.1" 200 OK +vLLM server is ready! +=== Starting GRPO training === |
