Skip to content

关于 QuickStart run_ppo_hotpotqa.sh 报错 #68

@TDY-raedae

Description

@TDY-raedae

在跑QuickStart里面的样例 run_ppo_hotpotqa.sh 报错信息如下,想请教一下哪里出了问题?
2025-09-10 16:35:45,628 INFO worker.py:1942 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 (raylet) The node with node id: b01282dc336e43db3b88340e4a2ba6076b51b0e0cf71b4a594b446f1 and address: 10.39.2.46 and node name: 10.39.2.46 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a (1) raylet crashes unexpectedly (OOM, etc.) (2) raylet has lagging heartbeats due to slow network or busy workload. Error executing job with overrides: ['data.train_files=[data/hotpotqa/train.parquet]', 'data.val_files=[data/hotpotqa/validation.parquet]', 'data.train_batch_size=128', 'data.max_prompt_length=8192', 'data.max_response_length=8192', 'data.max_response_length_single_turn=1024', 'actor_rollout_ref.model.path=Qwen/Qwen2.5-1.5B-Instruct', 'actor_rollout_ref.actor.optim.lr=1e-6', 'actor_rollout_ref.model.use_remove_padding=True', 'actor_rollout_ref.actor.ppo_mini_batch_size=64', 'actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2', 'actor_rollout_ref.model.enable_gradient_checkpointing=True', 'actor_rollout_ref.actor.fsdp_config.param_offload=False', 'actor_rollout_ref.actor.fsdp_config.optimizer_offload=False', 'actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=2', 'actor_rollout_ref.rollout.tensor_model_parallel_size=2', 'actor_rollout_ref.rollout.name=vllm', 'actor_rollout_ref.rollout.stop_token_ids=[151658]', 'actor_rollout_ref.rollout.stop=[]', 'actor_rollout_ref.rollout.gpu_memory_utilization=0.6', 'actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=2', 'actor_rollout_ref.ref.fsdp_config.param_offload=True', 'critic.optim.lr=1e-5', 'critic.model.use_remove_padding=True', 'critic.model.path=Qwen/Qwen2.5-1.5B-Instruct', 'critic.model.enable_gradient_checkpointing=True', 'critic.ppo_micro_batch_size_per_gpu=2', 'critic.model.fsdp_config.param_offload=False', 'critic.model.fsdp_config.optimizer_offload=False', 'algorithm.adv_estimator=gae', 'algorithm.kl_ctrl.kl_coef=0.001', 'algorithm.use_kl_in_reward=True', 'trainer.critic_warmup=5', 'trainer.logger=[console,wandb]', 'trainer.project_name=hotpotqa', 'trainer.experiment_name=ppo-qwen2.5-1.5b-instruct', 'trainer.n_gpus_per_node=4', 'trainer.nnodes=1', 'trainer.save_freq=-1', 'trainer.test_freq=10', 'trainer.total_epochs=1', 'trainer.val_before_train=True', 'trainer.log_val_generations=0', 'tool.max_turns=5', 'tool.tools=[search]', 'tool.max_tool_response_length=2048'] (raylet) [2025-09-10 16:35:54,891 E 57558 57666] (raylet) agent_manager.cc:86: The raylet exited immediately because one Ray agent failed, agent_name = runtime_env_agent. (raylet) The raylet fate shares with the agent. This can happen because (raylet) - The version of grpciodoesn't follow Ray's requirement. Agent can segfault with the incorrectgrpcioversion. Check the grpcio versionpip freeze | grep grpcio. (raylet) - The agent failed to start because of unexpected error or port conflict. Read the log cat /tmp/ray/session_latest/logs/{dashboard_agent|runtime_env_agent}.log`. You can find the log file structure here https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#logging-directory-structure.
(raylet) - The agent is killed by the OS (e.g., out of memory).
Traceback (most recent call last):
File "/public/home/shenninggroup/yjyan/Agent-R1/agent_r1/src/main_agent.py", line 67, in main
run_agent(config)
File "/public/home/shenninggroup/yjyan/Agent-R1/agent_r1/src/main_agent.py", line 79, in run_agent
ray.get(runner.run.remote(config))
File "/public/home/shenninggroup/yjyan/.conda/envs/verl/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
return fn(*args, **kwargs)
File "/public/home/shenninggroup/yjyan/.conda/envs/verl/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
return func(*args, **kwargs)
File "/public/home/shenninggroup/yjyan/.conda/envs/verl/lib/python3.10/site-packages/ray/_private/worker.py", line 2882, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
File "/public/home/shenninggroup/yjyan/.conda/envs/verl/lib/python3.10/site-packages/ray/_private/worker.py", line 970, in get_objects
raise value
ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task.
class_name: TaskRunner
actor_id: e8f2b574acb68373e54955f201000000
namespace: bc077af0-6205-4b63-a529-5919c27bcae2
The actor is dead because its owner has died. Owner Id: 01000000ffffffffffffffffffffffffffffffffffffffffffffffff Owner Ip address: 10.39.2.46 Owner worker exit type: SYSTEM_ERROR Worker exit detail: Owner's node has crashed.
The actor never ran - it was cancelled before it started running.

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.`

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions