Skip to content

feat(RL): add RL support for verl#1298

Open
shihaobai wants to merge 224 commits into
mainfrom
rl_verl_rebase_main
Open

feat(RL): add RL support for verl#1298
shihaobai wants to merge 224 commits into
mainfrom
rl_verl_rebase_main

Conversation

@shihaobai
Copy link
Copy Markdown
Collaborator

No description provided.

shihaobai and others added 30 commits May 9, 2026 09:29
# Conflicts:
#	lightllm/common/basemodel/cuda_graph.py
#	lightllm/server/api_start.py
#	lightllm/server/core/objs/start_args_type.py
#	lightllm/server/httpserver/manager.py
#	lightllm/server/router/model_infer/mode_backend/generic_post_process.py
# Conflicts:
#	lightllm/common/basemodel/layer_weights/meta_weights/fused_moe/fused_moe_weight.py
#	lightllm/common/basemodel/layer_weights/meta_weights/fused_moe/impl/base_impl.py
#	lightllm/common/basemodel/layer_weights/meta_weights/fused_moe/impl/triton_impl.py
#	lightllm/server/multimodal_params.py
- Add _assert_weight_ndim on the base mixin so 2D linear / 3D MoE merged
  weights are explicit, and 4D+ tensors fail fast instead of silently
  slicing.
- Fix a stale shape[1] assert in QuantizedColSliceMixin._slice_weight_scale
  that was inconsistent with the surrounding shape[-1] usage; only matters
  for 3D MoE quantized weights (2D was accidentally equivalent).
- Refresh the row/col convention comment to mention the MoE 3D layout.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ak httpserver debug logs

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rward

abort was previously fanned out from master httpserver to slave httpservers
over zmq so every node's local shm got is_aborted=True before the router's
MIN-allreduce agreed. now rank 0 is the single source of truth: the router
broadcast(src=0)s the aborted_req_mask and slaves write is_aborted back to
their local shm so recycle_resource_loop still observes a consistent state.

side effects:
- disable_abort gate is removed; is_disconnected -> abort now works in
  multinode tp dp=1 mode.
- PortLocker no longer locks nccl_port on slave ranks (only rank 0 binds the
  TCPStore listener), which fixes single-machine multi-node tp testing.

adds test/test_api/test_abort_chaos.py covering abort_all on N concurrent
streams and random per-stream disconnect chaos.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants