Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 31 additions & 1 deletion core/helm-charts/vllm/xeon-values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -243,6 +243,36 @@ modelConfigs:
tensor_parallel_size: "{{ .Values.tensor_parallel_size }}"
pipeline_parallel_size: "{{ .Values.pipeline_parallel_size }}"

"Qwen/Qwen2.5-VL-7B-Instruct":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the support for this VL model. For better performance in the xeon include these additional variables and extra command arguments. Also tensor parallel is calculated dynamically based on the system configuration where models are deployed.

configMapValues:
VLLM_CPU_KVCACHE_SPACE: "40"
VLLM_RPC_TIMEOUT: "100000"
VLLM_ALLOW_LONG_MAX_MODEL_LEN: "1"
VLLM_ENGINE_ITERATION_TIMEOUT_S: "120"
VLLM_CPU_NUM_OF_RESERVED_CPU: "0"
VLLM_CPU_SGL_KERNEL: "1"
HF_HUB_DISABLE_XET: "1"
extraCmdArgs:
[
"--block-size",
"128",
"--dtype",
"bfloat16",
"--distributed_executor_backend",
"mp",
"--enable_chunked_prefill",
"--enforce-eager",
"--max-model-len",
"33024",
"--max-num-batched-tokens",
"2048",
"--max-num-seqs",
"256",
]
tensor_parallel_size: "{{ .Values.tensor_parallel_size }}"
pipeline_parallel_size: "{{ .Values.pipeline_parallel_size }}"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestions!
I’ve updated xeon-values.yaml to include the additional configMap values and extra command arguments as suggested.
Please let me know if anything else needs adjustment.

configMapValues:
VLLM_SKIP_WARMUP: true
VLLM_CPU_KVCACHE_SPACE: "40"
VLLM_RPC_TIMEOUT: "100000"
VLLM_ALLOW_LONG_MAX_MODEL_LEN: "1"
VLLM_ENGINE_ITERATION_TIMEOUT_S: "120"
VLLM_CPU_NUM_OF_RESERVED_CPU: "0"
VLLM_CPU_SGL_KERNEL: "1"
HF_HUB_DISABLE_XET: "1"
extraCmdArgs:
[
"--block-size",
"128",
"--dtype",
"bfloat16",
"--distributed_executor_backend",
"mp",
"--enable_chunked_prefill",
"--enforce-eager",
"--max-model-len",
"33024",
"--max-num-batched-tokens",
"2048",
"--max-num-seqs",
"256",
]
tensor_parallel_size: "{{ .Values.tensor_parallel_size }}"
pipeline_parallel_size: "{{ .Values.pipeline_parallel_size }}"

defaultModelConfigs:
configMapValues:
VLLM_CPU_KVCACHE_SPACE: "40"
Expand Down Expand Up @@ -270,4 +300,4 @@ defaultModelConfigs:
"256",
]
tensor_parallel_size: "{{ .Values.tensor_parallel_size }}"
pipeline_parallel_size: "{{ .Values.pipeline_parallel_size }}"
pipeline_parallel_size: "{{ .Values.pipeline_parallel_size }}"