feat(RL): add RL support for verl#1298
Open
shihaobai wants to merge 224 commits into
Open
Conversation
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Weichao Luo <luoweichao@sensetime.com> Co-authored-by: shihaobai <1798930569@qq.com>
Co-authored-by: sufubao <sufubao@sensetime.com>
# Conflicts: # lightllm/common/basemodel/cuda_graph.py # lightllm/server/api_start.py # lightllm/server/core/objs/start_args_type.py # lightllm/server/httpserver/manager.py # lightllm/server/router/model_infer/mode_backend/generic_post_process.py
# Conflicts: # lightllm/common/basemodel/layer_weights/meta_weights/fused_moe/fused_moe_weight.py # lightllm/common/basemodel/layer_weights/meta_weights/fused_moe/impl/base_impl.py # lightllm/common/basemodel/layer_weights/meta_weights/fused_moe/impl/triton_impl.py # lightllm/server/multimodal_params.py
- Add _assert_weight_ndim on the base mixin so 2D linear / 3D MoE merged weights are explicit, and 4D+ tensors fail fast instead of silently slicing. - Fix a stale shape[1] assert in QuantizedColSliceMixin._slice_weight_scale that was inconsistent with the surrounding shape[-1] usage; only matters for 3D MoE quantized weights (2D was accidentally equivalent). - Refresh the row/col convention comment to mention the MoE 3D layout. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ak httpserver debug logs Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rward abort was previously fanned out from master httpserver to slave httpservers over zmq so every node's local shm got is_aborted=True before the router's MIN-allreduce agreed. now rank 0 is the single source of truth: the router broadcast(src=0)s the aborted_req_mask and slaves write is_aborted back to their local shm so recycle_resource_loop still observes a consistent state. side effects: - disable_abort gate is removed; is_disconnected -> abort now works in multinode tp dp=1 mode. - PortLocker no longer locks nccl_port on slave ranks (only rank 0 binds the TCPStore listener), which fixes single-machine multi-node tp testing. adds test/test_api/test_abort_chaos.py covering abort_all on N concurrent streams and random per-stream disconnect chaos. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.