Skip to content

Completed rollouts > total_tasks_queued #355

@chenhr18thu

Description

@chenhr18thu

GPT-5 analysis:
日志在 04:2104:26 之间持续出现 vLLM “最大上下文 20480 tokens,但请求 22k61k tokens” 的报错,所有调用都返回 400。(见 AAAA_8mcp_QwenAgent_8B_1203test1.log 中多段 ValueError)
这些超长请求来自当前 Hydra 配置:data.max_prompt_length=12288 + data.max_response_length=8192,再加上 multi-turn hermes 模板、系统提示等,引发实际 token 数远超 20480,同时又设置了 data.truncation=error,导致既不截断也不丢弃,直接把超长样本送给 vLLM。
每次 400 之后,AgentLightning 仍将该 rollout 记为“已完成”,但 reward/triplet 均为 None(日志中的大量警告)。随后同一个任务会被 re-claim 再跑一次,旧的失败结果和新的成功结果都可能滞留在 server 的 result 缓存里。
AgentModeDaemon._async_run_until_finished 不会清理 server 端过期的结果,只要 len(_completed_rollouts) >= _total_tasks_queued 就退出;因此当旧的失败结果在新一轮采样期间被取回时,局部计数会出现“Completed 33/32 tasks”这样的状态,最终在 get_train_data_batch 里触发断言。(agentlightning/verl/daemon.py)
结论
触发断言的根因是超长 prompt/response 导致 vLLM 连续返回 400,旧的失败 rollouts 没有被正确清理,与新一轮任务混在一起,造成 _completed_rollouts 数量大于 _total_tasks_queued。
建议
在向 vLLM 发请求前就做 token 级裁剪:把 data.truncation 改为 truncate(或显式在代理层截断),确保 prompt_len + response_len <= 20480。必要时调小 max_prompt_length/max_response_length 或改写 hermes 模板以控制 token。
在 clear_data_and_server 中调用 server 端的清理接口(或新加接口)以丢弃上一轮剩余的 _completed_rollouts;或者在 get_train_data_batch 前对多余 rollouts 做过滤,仅保留本轮 _task_id_to_original_sample 中出现的 id,避免 len 失衡。
若仍需保护运行,可临时把断言改成日志告警并丢弃多余 rollouts,但根本解法仍是限制上下文长度,避免产生这些失败任务。

log:

�[36m(PatchedvLLMServer pid=2403114)�[0m ERROR 12-03 04:26:37 [serving_chat.py:222] File "lib/python3.10/site-packages/vllm/entrypoints/openai/serving_engine.py", line 499, in _normalize_prompt_text_to_input
�[36m(PatchedvLLMServer pid=2403114)�[0m ERROR 12-03 04:26:37 [serving_chat.py:222] return self._validate_input(request, input_ids, input_text)
�[36m(PatchedvLLMServer pid=2403114)�[0m ERROR 12-03 04:26:37 [serving_chat.py:222] File "lib/python3.10/site-packages/vllm/entrypoints/openai/serving_engine.py", line 563, in _validate_input
�[36m(PatchedvLLMServer pid=2403114)�[0m ERROR 12-03 04:26:37 [serving_chat.py:222] raise ValueError(
�[36m(PatchedvLLMServer pid=2403114)�[0m ERROR 12-03 04:26:37 [serving_chat.py:222] ValueError: This model's maximum context length is 20480 tokens. However, you requested 49088 tokens in the messages, Please reduce the length of the messages.
36m(TaskRunner pid=2397748)�[0m Warning: Reward is None for rollout rollout-ff673b89-b65c-47e6-a77a-125ca8770652, will be auto-set to 0.0.
�[36m(TaskRunner pid=2397748)�[0m Warning: Triplet is None for rollout rollout-ff673b89-b65c-47e6-a77a-125ca8770652.
�[36m(TaskRunner pid=2397748)�[0m Completed 33/32 tasks...
�[36m(TaskRunner pid=2397748)�[0m INFO: 127.0.0.1:47154 - "GET /task HTTP/1.1" 200 OK
�[36m(TaskRunner pid=2397748)�[0m INFO: 127.0.0.1:47162 - "GET /task HTTP/1.1" 200 OK
�[36m(TaskRunner pid=2397748)�[0m INFO: 127.0.0.1:47172 - "GET /task HTTP/1.1" 200 OK
�[36m(TaskRunner pid=2397748)�[0m INFO: 127.0.0.1:47188 - "GET /task HTTP/1.1" 200 OK
�[36m(TaskRunner pid=2397748)�[0m INFO: 127.0.0.1:47200 - "GET /task HTTP/1.1" 200 OK
�[36m(TaskRunner pid=2397748)�[0m INFO: 127.0.0.1:47210 - "GET /task HTTP/1.1" 200 OK
�[36m(TaskRunner pid=2397748)�[0m INFO: 127.0.0.1:47212 - "GET /task HTTP/1.1" 200 OK
�[36m(TaskRunner pid=2397748)�[0m INFO: 127.0.0.1:47226 - "GET /task HTTP/1.1" 200 OK
�[36m(TaskRunner pid=2397748)�[0m All tasks finished.

Traceback (most recent call last):
File "lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "ib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "lib/python3.10/site-packages/agentlightning/verl/main.py", line 4, in
main()
File "lib/python3.10/site-packages/hydra/main.py", line 94, in decorated_main
_run_hydra(
File " /lib/python3.10/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
_run_app(
File " /lib/python3.10/site-packages/hydra/_internal/utils.py", line 457, in _run_app
run_and_report(
File " /lib/python3.10/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
raise ex
File " /lib/python3.10/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
return func()
File " /lib/python3.10/site-packages/hydra/_internal/utils.py", line 458, in
lambda: hydra.run(
File " /lib/python3.10/site-packages/hydra/_internal/hydra.py", line 132, in run
_ = ret.return_value
File " /lib/python3.10/site-packages/hydra/core/utils.py", line 260, in return_value
raise self._return_value
File " /lib/python3.10/site-packages/hydra/core/utils.py", line 186, in run_job
ret.return_value = task_function(task_cfg)
File " /lib/python3.10/site-packages/agentlightning/verl/entrypoint.py", line 12, in main
run_ppo(config)
File " /lib/python3.10/site-packages/agentlightning/verl/entrypoint.py", line 26, in run_ppo
ray.get(runner.run.remote(config))
File " /lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
return fn(*args, **kwargs)
File " /lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
return func(*args, **kwargs)
File " /lib/python3.10/site-packages/ray/_private/worker.py", line 2849, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
File " /lib/python3.10/site-packages/ray/_private/worker.py", line 937, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(AssertionError): �[36mray::TaskRunner.run()�[39m (pid=2397748, ip=178.28.32.225, actor_id=46872e1f62fadf329947f7ec02000000, repr=<agentlightning.verl.entrypoint.TaskRunner object at 0x7f7da5245000>)
File " /lib/python3.10/site-packages/agentlightning/verl/entrypoint.py", line 152, in run
trainer.fit()
File " /lib/python3.10/site-packages/agentlightning/verl/trainer.py", line 353, in fit
metrics = self._train_step(batch_dict)
File " /lib/python3.10/site-packages/agentlightning/verl/trainer.py", line 95, in _train_step
batch, agent_metrics = self.agent_mode_daemon.get_train_data_batch(
File " /lib/python3.10/site-packages/agentlightning/verl/daemon.py", line 419, in get_train_data_batch
assert len(self._completed_rollouts) == self._total_tasks_queued
AssertionError

Metadata

Metadata

Assignees

No one assigned

    Labels

    help wantedExtra hands from community will be appreciatedverl

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions