PPO training fails with NCCL error on multi-GPU setup due to unsupported peer access between devices.
(WorkerDict pid=565043) /home/aiscuser/CODES/AgentFly/uv_agentfly/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:4631: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user.
(WorkerDict pid=565043) warnings.warn( # warn only once
(WorkerDict pid=564401) [rank0]:[W831 14:46:12.241792627 ProcessGroupNCCL.cpp:4718] [PG ID 0 PG GUID 0 Rank 0] using GPU 0 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can pecify device_id in init_process_group() to force use of a particular device.
Error executing job with overrides: ['algorithm.adv_estimator=grpo', 'data.train_files=/mnt/yixiali/CODES/AgentFly/data/rlhf/math//orz_math_57k_train.json', 'data.val_files=/mnt/yixiali/CODES/AgentFly/data/rlhf/math//MATH_500.json', 'data.train_batch_size=64', 'agent.agent_type=code', 'agent.tools=[code_interpreter]', 'agent.template=qwen2.5-no-system-tool', 'agent.model_name_or_path=Qwen/Qwen2.5-3B-Instruct', 'agent.max_turns=8', 'agent.backend=async_verl', 'agent.reward_name=math_reward_tool', 'agent.num_chains=8', 'agent.use_agent=True', 'actor_rollout_ref.actor.optim.lr=5e-7', 'actor_rollout_ref.model.use_remove_padding=False', 'actor_rollout_ref.model.path=Qwen/Qwen2.5-3B-Instruct', 'actor_rollout_ref.actor.ppo_mini_batch_size=64', 'actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2', 'actor_rollout_ref.actor.use_kl_loss=True', 'actor_rollout_ref.actor.kl_loss_coef=0.001', 'actor_rollout_ref.actor.kl_loss_type=mse', 'actor_rollout_ref.actor.entropy_coeff=0.001', 'actor_rollout_ref.model.enable_gradient_checkpointing=False', 'actor_rollout_ref.actor.fsdp_config.param_offload=True', 'actor_rollout_ref.actor.fsdp_config.optimizer_offload=True', 'actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=4', 'actor_rollout_ref.rollout.tensor_model_parallel_size=2', 'actor_rollout_ref.rollout.name=vllm', 'actor_rollout_ref.rollout.response_length=512', 'actor_rollout_ref.rollout.gpu_memory_utilization=0.5', 'actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4', 'actor_rollout_ref.ref.fsdp_config.param_offload=True', 'critic.model.path=Qwen/Qwen2.5-3B-Instruct', 'critic.ppo_mini_batch_size=64', 'critic.ppo_micro_batch_size_per_gpu=2', 'algorithm.kl_ctrl.kl_coef=0.001', 'trainer.critic_warmup=0', 'trainer.logger=[console,wandb]', 'trainer.project_name=AgentRL', 'trainer.experiment_name=test', 'trainer.n_gpus_per_node=4', 'trainer.nnodes=1', 'trainer.save_freq=50', 'trainer.test_freq=10', 'trainer.total_training_steps=200', 'trainer.val_before_train=False']
Traceback (most recent call last):
File "/opt/conda/envs/ptca/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/envs/ptca/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/aiscuser/CODES/AgentFly/verl/verl/trainer/main_ppo.py", line 244, in <module>
main()
File "/home/aiscuser/CODES/AgentFly/uv_agentfly/lib/python3.10/site-packages/hydra/main.py", line 94, in decorated_main
_run_hydra(
File "/home/aiscuser/CODES/AgentFly/uv_agentfly/lib/python3.10/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
_run_app(
File "/home/aiscuser/CODES/AgentFly/uv_agentfly/lib/python3.10/site-packages/hydra/_internal/utils.py", line 457, in _run_app
run_and_report(
File "/home/aiscuser/CODES/AgentFly/uv_agentfly/lib/python3.10/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
raise ex
File "/home/aiscuser/CODES/AgentFly/uv_agentfly/lib/python3.10/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
return func()
File "/home/aiscuser/CODES/AgentFly/uv_agentfly/lib/python3.10/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
lambda: hydra.run(
File "/home/aiscuser/CODES/AgentFly/uv_agentfly/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 132, in run
_ = ret.return_value
File "/home/aiscuser/CODES/AgentFly/uv_agentfly/lib/python3.10/site-packages/hydra/core/utils.py", line 260, in return_value
raise self._return_value
File "/home/aiscuser/CODES/AgentFly/uv_agentfly/lib/python3.10/site-packages/hydra/core/utils.py", line 186, in run_job
ret.return_value = task_function(task_cfg)
File "/home/aiscuser/CODES/AgentFly/verl/verl/trainer/main_ppo.py", line 62, in main
run_ppo(config)
File "/home/aiscuser/CODES/AgentFly/verl/verl/trainer/main_ppo.py", line 74, in run_ppo
ray.get(runner.run.remote(config))
File "/home/aiscuser/CODES/AgentFly/uv_agentfly/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
return fn(*args, **kwargs)
File "/home/aiscuser/CODES/AgentFly/uv_agentfly/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
return func(*args, **kwargs)
File "/home/aiscuser/CODES/AgentFly/uv_agentfly/lib/python3.10/site-packages/ray/_private/worker.py", line 2882, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
File "/home/aiscuser/CODES/AgentFly/uv_agentfly/lib/python3.10/site-packages/ray/_private/worker.py", line 968, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DistBackendError): ray::TaskRunner.run() (pid=560810, ip=100.64.24.39, actor_id=e8bf18909706ae5475513a4701000000, repr=<main_ppo.TaskRunner object at 0x756c19ba2c80>)
File "/home/aiscuser/CODES/AgentFly/verl/verl/trainer/main_ppo.py", line 180, in run
trainer.init_workers()
File "/home/aiscuser/CODES/AgentFly/verl/verl/trainer/ppo/ray_trainer.py", line 746, in init_workers
self.ref_policy_wg.init_model()
File "/home/aiscuser/CODES/AgentFly/verl/verl/single_controller/ray/base.py", line 49, in func
output = ray.get(output)
ray.exceptions.RayTaskError(DistBackendError): ray::WorkerDict.ref_init_model() (pid=565045, ip=100.64.24.39, actor_id=7cc9f5a6383c4734a8d1c4f501000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0x749e8269b040>)
File "/home/aiscuser/CODES/AgentFly/verl/verl/single_controller/ray/base.py", line 466, in func
return getattr(self.worker_dict[key], name)(*args, **kwargs)
File "/home/aiscuser/CODES/AgentFly/verl/verl/single_controller/base/decorator.py", line 501, in inner
return func(*args, **kwargs)
File "/home/aiscuser/CODES/AgentFly/verl/verl/workers/fsdp_workers.py", line 521, in init_model
self.ref_module_fsdp = self._build_model_optimizer(
File "/home/aiscuser/CODES/AgentFly/verl/verl/workers/fsdp_workers.py", line 233, in _build_model_optimizer
torch.distributed.barrier()
File "/home/aiscuser/CODES/AgentFly/uv_agentfly/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
return func(*args, **kwargs)
File "/home/aiscuser/CODES/AgentFly/uv_agentfly/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4635, in barrier
work = group.barrier(opts=opts)
torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:3356, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.26.2
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 217 'peer access is not supported between these two devices'
Cleaning up environments...
0it [00:00, ?it/s]
(TaskRunner pid=560810) Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::WorkerDict.ref_init_model() (pid=565044, ip=100.64.24.39, actor_id=0f341df54875fc057ed0c0bd01000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0x7d703320b040>)
(TaskRunner pid=560810) File "/home/aiscuser/CODES/AgentFly/verl/verl/single_controller/ray/base.py", line 466, in func
(TaskRunner pid=560810) return getattr(self.worker_dict[key], name)(*args, **kwargs)
(TaskRunner pid=560810) File "/home/aiscuser/CODES/AgentFly/verl/verl/single_controller/base/decorator.py", line 501, in inner
(TaskRunner pid=560810) return func(*args, **kwargs)
(TaskRunner pid=560810) File "/home/aiscuser/CODES/AgentFly/verl/verl/workers/fsdp_workers.py", line 521, in init_model
(TaskRunner pid=560810) self.ref_module_fsdp = self._build_model_optimizer(
(TaskRunner pid=560810) File "/home/aiscuser/CODES/AgentFly/verl/verl/workers/fsdp_workers.py", line 233, in _build_model_optimizer
(TaskRunner pid=560810) torch.distributed.barrier()
(TaskRunner pid=560810) File "/home/aiscuser/CODES/AgentFly/uv_agentfly/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
(TaskRunner pid=560810) return func(*args, **kwargs)
(TaskRunner pid=560810) File "/home/aiscuser/CODES/AgentFly/uv_agentfly/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4635, in barrier
(TaskRunner pid=560810) work = group.barrier(opts=opts)
(TaskRunner pid=560810) torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:3356, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.26.2
(TaskRunner pid=560810) ncclUnhandledCudaError: Call to CUDA function failed.
(TaskRunner pid=560810) Last error:
(TaskRunner pid=560810) Cuda failure 217 'peer access is not supported between these two devices'
(TaskRunner pid=560810) Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::WorkerDict.ref_init_model() (pid=564401, ip=100.64.24.39, actor_id=89558e3d0536a122e39c2a7e01000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0x74fa36f1f0a0>)
(TaskRunner pid=560810) File "/home/aiscuser/CODES/AgentFly/verl/verl/single_controller/ray/base.py", line 466, in func
(TaskRunner pid=560810) return getattr(self.worker_dict[key], name)(*args, **kwargs)
(TaskRunner pid=560810) File "/home/aiscuser/CODES/AgentFly/verl/verl/single_controller/base/decorator.py", line 501, in inner
(TaskRunner pid=560810) return func(*args, **kwargs)
(TaskRunner pid=560810) File "/home/aiscuser/CODES/AgentFly/verl/verl/workers/fsdp_workers.py", line 521, in init_model
(TaskRunner pid=560810) self.ref_module_fsdp = self._build_model_optimizer(
(TaskRunner pid=560810) File "/home/aiscuser/CODES/AgentFly/verl/verl/workers/fsdp_workers.py", line 233, in _build_model_optimizer
(TaskRunner pid=560810) torch.distributed.barrier()
(TaskRunner pid=560810) File "/home/aiscuser/CODES/AgentFly/uv_agentfly/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
(TaskRunner pid=560810) return func(*args, **kwargs)
(TaskRunner pid=560810) File "/home/aiscuser/CODES/AgentFly/uv_agentfly/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4635, in barrier
(TaskRunner pid=560810) work = group.barrier(opts=opts)
(TaskRunner pid=560810) torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:3356, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.26.2
(TaskRunner pid=560810) ncclUnhandledCudaError: Call to CUDA function failed.
(TaskRunner pid=560810) Last error:
(TaskRunner pid=560810) Cuda failure 217 'peer access is not supported between these two devices'
(TaskRunner pid=560810) Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::WorkerDict.ref_init_model() (pid=565043, ip=100.64.24.39, actor_id=82614e5081ad45519bc4c3f201000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0x70727f5d6f20>)
(TaskRunner pid=560810) File "/home/aiscuser/CODES/AgentFly/verl/verl/single_controller/ray/base.py", line 466, in func
(TaskRunner pid=560810) return getattr(self.worker_dict[key], name)(*args, **kwargs)
(TaskRunner pid=560810) File "/home/aiscuser/CODES/AgentFly/verl/verl/single_controller/base/decorator.py", line 501, in inner
(TaskRunner pid=560810) return func(*args, **kwargs)
(TaskRunner pid=560810) File "/home/aiscuser/CODES/AgentFly/verl/verl/workers/fsdp_workers.py", line 521, in init_model
(TaskRunner pid=560810) self.ref_module_fsdp = self._build_model_optimizer(
(TaskRunner pid=560810) File "/home/aiscuser/CODES/AgentFly/verl/verl/workers/fsdp_workers.py", line 233, in _build_model_optimizer
(TaskRunner pid=560810) torch.distributed.barrier()
(TaskRunner pid=560810) File "/home/aiscuser/CODES/AgentFly/uv_agentfly/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
(TaskRunner pid=560810) return func(*args, **kwargs)
(TaskRunner pid=560810) File "/home/aiscuser/CODES/AgentFly/uv_agentfly/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4635, in barrier
(TaskRunner pid=560810) work = group.barrier(opts=opts)
(TaskRunner pid=560810) torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:3356, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.26.2
(TaskRunner pid=560810) ncclUnhandledCudaError: Call to CUDA function failed.
(TaskRunner pid=560810) Last error:
(TaskRunner pid=560810) Cuda failure 217 'peer access is not supported between these two devices'
(WorkerDict pid=565044) [W831 14:46:10.832875108 Utils.hpp:137] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function operator()) [repeated 3x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(WorkerDict pid=565044) [W831 14:46:10.832074897 socket.cpp:755] [c10d] The client socket cannot be initialized to connect to [100-64-24-39.proxy-node-0.79e5d84c-257e-473d-ad40-0b89b73e0ad7.svc.cluster.local]:50601 (errno: 97 - Address family not supported by protocol).
(WorkerDict pid=564401) `torch_dtype` is deprecated! Use `dtype` instead! [repeated 3x across cluster]
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s] [repeated 3x across cluster]
Loading checkpoint shards: 100%|██████████| 2/2 [00:00<00:00, 16.22it/s] [repeated 3x across cluster]
(WorkerDict pid=564401) /home/aiscuser/CODES/AgentFly/uv_agentfly/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:4631: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user. [repeated 3x across cluster]
(WorkerDict pid=564401) warnings.warn( # warn only once [repeated 3x across cluster]
(WorkerDict pid=565044) [rank2]:[W831 14:46:12.312774421 ProcessGroupNCCL.cpp:4718] [PG ID 0 PG GUID 0 Rank 2] using GPU 0 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can pecify device_id in init_process_group() to force use of a particular device. [repeated 3x across cluster]
NCCL P2P Communication Error During Multi-GPU Training
Problem
PPO training fails with NCCL error on multi-GPU setup due to unsupported peer access between devices.
Error Details
Root Cause
ncclUnhandledCudaError: Call to CUDA function failed. Last error: Cuda failure 217 'peer access is not supported between these two devices'Attempted Solutions
Environment variables tried but failed:
References
Environment Setup
Package Versions
Training Script