From 678e7c68345ff8b40c02803f01f4a0500788ac70 Mon Sep 17 00:00:00 2001 From: Pawel Date: Tue, 3 Mar 2026 13:59:02 +0100 Subject: [PATCH 01/22] save --- demos/continuous_batching/agentic_ai/README.md | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/demos/continuous_batching/agentic_ai/README.md b/demos/continuous_batching/agentic_ai/README.md index 4963110568..514f673962 100644 --- a/demos/continuous_batching/agentic_ai/README.md +++ b/demos/continuous_batching/agentic_ai/README.md @@ -570,6 +570,12 @@ pip install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/ma ``` Make sure nodejs and npx are installed. On ubuntu it would require `sudo apt install nodejs npm`. On windows, visit https://nodejs.org/en/download. It is needed for the `file system` MCP server. +For windows applications it may be required to set environmental variable to enforce utf-8 encodeing in python: + +```bat +set PYTHONUTF8=1 +``` + Run the agentic application: From ace7b26f0312b696c6d35671355692673a5ed83d Mon Sep 17 00:00:00 2001 From: Pawel Date: Wed, 4 Mar 2026 13:03:28 +0100 Subject: [PATCH 02/22] draft --- .../continuous_batching/agentic_ai/README.md | 412 +++--------------- 1 file changed, 57 insertions(+), 355 deletions(-) diff --git a/demos/continuous_batching/agentic_ai/README.md b/demos/continuous_batching/agentic_ai/README.md index 514f673962..dda38439f4 100644 --- a/demos/continuous_batching/agentic_ai/README.md +++ b/demos/continuous_batching/agentic_ai/README.md @@ -12,165 +12,6 @@ The tools can also be used for automation purposes based on input in text format > **Note:** On Windows, make sure to use the weekly or 2025.4 release packages for proper functionality. -## Export LLM model -Currently supported models: -- Qwen/Qwen3-8B -- Qwen/Qwen3-4B -- meta-llama/Llama-3.1-8B-Instruct -- meta-llama/Llama-3.2-3B-Instruct -- NousResearch/Hermes-3-Llama-3.1-8B -- mistralai/Mistral-7B-Instruct-v0.3 -- microsoft/Phi-4-mini-instruct -- Qwen/Qwen3-Coder-30B-A3B-Instruct -- openai/gpt-oss-20b - - -### Export using python script - -Use those steps to convert the model from HugginFace Hub to OpenVINO format and export it to a local storage. - -```console -# Download export script, install its dependencies and create directory for the models -curl https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/common/export_models/export_model.py -o export_model.py -pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/common/export_models/requirements.txt -mkdir models -``` -Run `export_model.py` script to download and quantize the model: - -> **Note:** The users in China need to set environment variable HF_ENDPOINT="https://hf-mirror.com" or "https://www.modelscope.cn/models" before running the export script to connect to the HF Hub. - -::::{tab-set} -:::{tab-item} Qwen3-8B -:sync: Qwen3-8B -```console -python export_model.py text_generation --source_model Qwen/Qwen3-8B --weight-format int8 --config_file_path models/config.json --model_repository_path models --tool_parser hermes3 -``` -::: -:::{tab-item} Qwen3-4B -:sync: Qwen3-4B -```console -python export_model.py text_generation --source_model Qwen/Qwen3-4B --weight-format int8 --config_file_path models/config.json --model_repository_path models --tool_parser hermes3 -``` -::: -:::{tab-item} Llama-3.1-8B-Instruct -:sync: Llama-3.1-8B-Instruct -```console -python export_model.py text_generation --source_model meta-llama/Llama-3.1-8B-Instruct --weight-format int8 --config_file_path models/config.json --model_repository_path models --tool_parser llama3 -curl -L -o models/meta-llama/Llama-3.1-8B-Instruct/chat_template.jinja https://raw.githubusercontent.com/vllm-project/vllm/refs/tags/v0.9.0/examples/tool_chat_template_llama3.1_json.jinja -``` -::: -:::{tab-item} Llama-3.2-3B-Instruct -:sync: Llama-3.2-3B-Instruct -```console -python export_model.py text_generation --source_model meta-llama/Llama-3.2-3B-Instruct --weight-format int8 --config_file_path models/config.json --model_repository_path models --tool_parser llama3 -curl -L -o models/meta-llama/Llama-3.2-3B-Instruct/chat_template.jinja https://raw.githubusercontent.com/vllm-project/vllm/refs/tags/v0.9.0/examples/tool_chat_template_llama3.2_json.jinja -``` -::: -:::{tab-item} Hermes-3-Llama-3.1-8B -:sync: Hermes-3-Llama-3.1-8B -```console -python export_model.py text_generation --source_model NousResearch/Hermes-3-Llama-3.1-8B --weight-format int8 --config_file_path models/config.json --model_repository_path models --tool_parser hermes3 -curl -L -o models/NousResearch/Hermes-3-Llama-3.1-8B/chat_template.jinja https://raw.githubusercontent.com/vllm-project/vllm/refs/tags/v0.9.0/examples/tool_chat_template_hermes.jinja -``` -::: -:::{tab-item} Mistral-7B-Instruct-v0.3 -:sync: Mistral-7B-Instruct-v0.3 -```console -python export_model.py text_generation --source_model mistralai/Mistral-7B-Instruct-v0.3 --weight-format int8 --config_file_path models/config.json --model_repository_path models --tool_parser mistral --extra_quantization_params "--task text-generation-with-past" -curl -L -o models/mistralai/Mistral-7B-Instruct-v0.3/chat_template.jinja https://raw.githubusercontent.com/vllm-project/vllm/refs/tags/v0.10.1.1/examples/tool_chat_template_mistral_parallel.jinja -``` -::: -:::{tab-item} Qwen3-Coder-30B-A3B-Instruct -:sync: Qwen3-Coder-30B-A3B-Instruct -```console -python export_model.py text_generation --source_model Qwen/Qwen3-Coder-30B-A3B-Instruct --weight-format int4 --config_file_path models/config.json --model_repository_path models --tool_parser qwen3coder -curl -L -o models/Qwen/Qwen3-Coder-30B-A3B-Instruct/chat_template.jinja https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/extras/chat_template_examples/chat_template_qwen3coder_instruct.jinja -``` -::: -:::{tab-item} gpt-oss-20b -:sync: gpt-oss-20b -```console -python export_model.py text_generation --source_model openai/gpt-oss-20b --weight-format int4 --config_file_path models/config.json --model_repository_path models --tool_parser gptoss --reasoning_parser gptoss -curl -L -o models/openai/gpt-oss-20b/chat_template.jinja https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/extras/chat_template_examples/chat_template_gpt_oss.jinja -``` -> **Note:** Continuous batching and paged attention are supported for GPT‑OSS. However, when deployed on GPU, the model may experience reduced accuracy under high‑concurrency workloads. This issue will be resolved in version 2026.1 and in the upcoming weekly release. CPU execution is not affected. - -::: -:::{tab-item} Phi-4-mini-instruct -:sync: microsoft/Phi-4-mini-instruct -Note: This model requires a fix in optimum-intel which is currently on a fork. -```console -pip3 install transformers==4.53.3 --force-reinstall -pip3 install "optimum-intel[openvino]"@git+https://github.com/helena-intel/optimum-intel/@ea/lonrope_exp -python export_model.py text_generation --source_model microsoft/Phi-4-mini-instruct --weight-format int4 --config_file_path models/config.json --model_repository_path models --tool_parser phi4 --max_num_batched_tokens 99999 -curl -L -o models/microsoft/Phi-4-mini-instruct/chat_template.jinja https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/extras/chat_template_examples/chat_template_phi4_mini.jinja -``` -::: -:::: - -> **Note:** To use these models on NPU, set `--weight-format` to either **int4** or **nf4**. When specifying `--extra_quantization_params`, ensure that `ratio` is set to **1.0** and `group_size` is set to **-1** or **128**. For more details, see [OpenVINO GenAI on NPU](https://docs.openvino.ai/nightly/openvino-workflow-generative/inference-with-genai/inference-with-genai-on-npu.html). - -### Direct pulling of pre-configured HuggingFace models from docker containers - -This procedure can be used to pull preconfigured models from OpenVINO organization in HuggingFace Hub -::::{tab-set} -:::{tab-item} Qwen3-8B-int4-ov -:sync: Qwen3-8B-int4-ov -```bash -docker run --user $(id -u):$(id -g) --rm -v $(pwd)/models:/models:rw openvino/model_server:weekly --pull --model_repository_path /models --source_model OpenVINO/Qwen3-8B-int4-ov --task text_generation --tool_parser hermes3 -``` -::: -:::{tab-item} Mistral-7B-Instruct-v0.3-int4-ov -:sync: Mistral-7B-Instruct-v0.3-int4-ov -```bash -docker run --user $(id -u):$(id -g) --rm -v $(pwd)/models:/models:rw openvino/model_server:weekly --pull --model_repository_path /models --source_model OpenVINO/Mistral-7B-Instruct-v0.3-int4-ov --task text_generation --tool_parser mistral -curl -L -o models/OpenVINO/Mistral-7B-Instruct-v0.3-int4-ov/chat_template.jinja https://raw.githubusercontent.com/vllm-project/vllm/refs/tags/v0.10.1.1/examples/tool_chat_template_mistral_parallel.jinja -``` -::: -:::{tab-item} Phi-4-mini-instruct-int4-ov -:sync: Phi-4-mini-instruct-int4-ov -```bash -docker run --user $(id -u):$(id -g) --rm -v $(pwd)/models:/models:rw openvino/model_server:weekly --pull --model_repository_path /models --source_model OpenVINO/Phi-4-mini-instruct-int4-ov --task text_generation --tool_parser phi4 -curl -L -o models/OpenVINO/Phi-4-mini-instruct-int4-ov/chat_template.jinja https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/extras/chat_template_examples/chat_template_phi4_mini.jinja -``` -::: -:::: - - -### Direct pulling of pre-configured HuggingFace models on Windows - -Assuming you have unpacked model server package with python enabled version, make sure to run `setupvars` script -as mentioned in [deployment guide](../../../docs/deploying_server_baremetal.md), in every new shell that will start OpenVINO Model Server. - -::::{tab-set} -:::{tab-item} Qwen3-8B-int4-ov -:sync: Qwen3-8B-int4-ov -```bat -ovms.exe --pull --model_repository_path models --source_model OpenVINO/Qwen3-8B-int4-ov --task text_generation --tool_parser hermes3 --enable_prefix_caching true -``` -::: -:::{tab-item} Mistral-7B-Instruct-v0.3-int4-ov -:sync: Mistral-7B-Instruct-v0.3-int4-ov -```bat -ovms.exe --pull --model_repository_path models --source_model OpenVINO/Mistral-7B-Instruct-v0.3-int4-ov --task text_generation --tool_parser mistral --enable_prefix_caching true -curl -L -o models\OpenVINO\Mistral-7B-Instruct-v0.3-int4-ov\chat_template.jinja https://raw.githubusercontent.com/vllm-project/vllm/refs/tags/v0.10.1.1/examples/tool_chat_template_mistral_parallel.jinja -``` -::: -:::{tab-item} Phi-4-mini-instruct-int4-ov -:sync: Phi-4-mini-instruct-int4-ov -```bat -ovms.exe --pull --model_repository_path models --source_model OpenVINO/Phi-4-mini-instruct-int4-ov --task text_generation --tool_parser phi4 --enable_prefix_caching true -curl -L -o models\OpenVINO\Phi-4-mini-instruct-int4-ov\chat_template.jinja https://raw.githubusercontent.com/vllm-project/vllm/refs/tags/v0.9.0/examples/tool_chat_template_phi4_mini.jinja -``` -::: -:::: - -You can use similar commands for different models and precision. Change the source_model and other configuration parameters. -> **Note:** Some models give more reliable responses with tuned chat template. -> **Note:** Currently tool parsers are supported for formats compatible with Phi4, Llama3, Mistral, Devstral, Hermes3 or GPT-OSS. - - - ## Start OVMS This deployment procedure assumes the model was pulled or exported using the procedure above. The exception are models from OpenVINO organization if they support tools correctly with the default template like "OpenVINO/Qwen3-8B-int4-ov" - they can be deployed in a single command pulling and staring the server. @@ -184,68 +25,44 @@ as mentioned in [deployment guide](../../../docs/deploying_server_baremetal.md), :::{tab-item} Qwen3-8B :sync: Qwen3-8B ```bat -ovms.exe --rest_port 8000 --source_model Qwen/Qwen3-8B --model_repository_path models --tool_parser hermes3 --target_device GPU --task text_generation --cache_dir .cache --enable_prefix_caching true +ovms.exe --rest_port 8000 --source_model OpenVINO/Qwen3-8B-int4-ov --model_repository_path models --tool_parser hermes3 --target_device GPU --task text_generation --cache_dir .cache --enable_prefix_caching true ``` ::: :::{tab-item} Qwen3-4B :sync: Qwen3-4B ```bat -ovms.exe --rest_port 8000 --source_model Qwen/Qwen3-4B --model_repository_path models --tool_parser hermes3 --target_device GPU --task text_generation --cache_dir .cache --enable_prefix_caching true -``` -::: -:::{tab-item} Llama-3.1-8B-Instruct -:sync: Llama-3.1-8B-Instruct -```bat -ovms.exe --rest_port 8000 --source_model meta-llama/Llama-3.1-8B-Instruct --model_repository_path models --tool_parser llama3 --target_device GPU --task text_generation --enable_tool_guided_generation true --cache_dir .cache --enable_prefix_caching true +ovms.exe --rest_port 8000 --source_model OpenVINO/Qwen3-4B-int4-ov --model_repository_path models --tool_parser hermes3 --target_device GPU --task text_generation --cache_dir .cache --enable_prefix_caching true ``` ::: :::{tab-item} Llama-3.2-3B-Instruct :sync: Llama-3.2-3B-Instruct ```bat -ovms.exe --rest_port 8000 --source_model meta-llama/Llama-3.2-3B-Instruct --model_repository_path models --tool_parser llama3 --target_device GPU --task text_generation --enable_tool_guided_generation true --cache_dir .cache --enable_prefix_caching true +ovms.exe --rest_port 8000 --source_model srang992/Llama-3.2-3B-Instruct-ov-INT4 --model_repository_path models --tool_parser llama3 --target_device GPU --task text_generation --enable_tool_guided_generation true --cache_dir .cache --enable_prefix_caching true ``` ::: :::{tab-item} Mistral-7B-Instruct-v0.3 :sync: Mistral-7B-Instruct-v0.3 ```bat -ovms.exe --rest_port 8000 --source_model mistralai/Mistral-7B-Instruct-v0.3 --model_repository_path models --tool_parser mistral --target_device GPU --task text_generation --cache_dir .cache --enable_prefix_caching true +ovms.exe --rest_port 8000 --source_model OpenVINO/Mistral-7B-Instruct-v0.3-int4-ov --model_repository_path models --tool_parser mistral --target_device GPU --task text_generation --cache_dir .cache --enable_prefix_caching true ``` ::: :::{tab-item} Phi-4-mini-instruct :sync: Phi-4-mini-instruct ```bat -ovms.exe --rest_port 8000 --source_model microsoft/Phi-4-mini-instruct --model_repository_path models --tool_parser phi4 --target_device GPU --task text_generation --enable_tool_guided_generation true --cache_dir .cache --max_num_batched_tokens 99999 --enable_prefix_caching true -``` -::: -:::{tab-item} Qwen3-8B-int4-ov -:sync: Qwen3-8B-int4-ov -```bat -ovms.exe --rest_port 8000 --source_model OpenVINO/Qwen3-8B-int4-ov --model_repository_path models --tool_parser hermes3 --target_device GPU --task text_generation --cache_dir .cache --enable_prefix_caching true -``` -::: -:::{tab-item} Mistral-7B-Instruct-v0.3-int4-ov -:sync: Mistral-7B-Instruct-v0.3-int4-ov -```bat -ovms.exe --rest_port 8000 --source_model OpenVINO/Mistral-7B-Instruct-v0.3-int4-ov --model_repository_path models --tool_parser mistral --target_device GPU --task text_generation --cache_dir .cache --enable_prefix_caching true -``` -::: -:::{tab-item} Phi-4-mini-instruct-int4-ov -:sync: Phi-4-mini-instruct-int4-ov -```bat -ovms.exe --rest_port 8000 --source_model OpenVINO/Phi-4-mini-instruct-int4-ov --model_repository_path models --tool_parser phi4 --target_device GPU --task text_generation --enable_tool_guided_generation true --cache_dir .cache --enable_prefix_caching true +ovms.exe --rest_port 8000 --source_model OpenVINO/Phi-4-mini-instruct-int4-ov --model_repository_path models --tool_parser phi4 --target_device GPU --task text_generation --enable_tool_guided_generation true --cache_dir .cache --max_num_batched_tokens 99999 --enable_prefix_caching true ``` ::: :::{tab-item} Qwen3-Coder-30B-A3B-Instruct :sync: Qwen3-Coder-30B-A3B-Instruct ```bat set MOE_USE_MICRO_GEMM_PREFILL=0 -ovms.exe --rest_port 8000 --source_model Qwen/Qwen3-Coder-30B-A3B-Instruct --model_repository_path models --tool_parser qwen3coder --target_device GPU --task text_generation --cache_dir .cache --enable_prefix_caching true +ovms.exe --rest_port 8000 --source_model OpenVINO/Qwen3-Coder-30B-A3B-Instruct-int4-ov --model_repository_path models --tool_parser qwen3coder --target_device GPU --task text_generation --cache_dir .cache --enable_prefix_caching true ``` ::: :::{tab-item} gpt-oss-20b :sync: gpt-oss-20b ```bat -ovms.exe --rest_port 8000 --source_model openai/gpt-oss-20b --model_repository_path models --tool_parser gptoss --reasoning_parser gptoss --task text_generation --enable_prefix_caching true --target_device GPU +ovms.exe --rest_port 8000 --source_model OpenVINO/gpt-oss-20b-int4-ov --model_repository_path models --tool_parser gptoss --reasoning_parser gptoss --task text_generation --enable_prefix_caching true --target_device GPU ``` > **Note:** Continuous batching and paged attention are supported for GPT‑OSS. However, when deployed on GPU, the model may experience reduced accuracy under high‑concurrency workloads. This issue will be resolved in version 2026.1 and in the upcoming weekly release. CPU execution is not affected. @@ -258,46 +75,27 @@ ovms.exe --rest_port 8000 --source_model openai/gpt-oss-20b --model_repository_p :::{tab-item} Qwen3-8B :sync: Qwen3-8B ```bat -ovms.exe --rest_port 8000 --source_model Qwen/Qwen3-8B --model_repository_path models --tool_parser hermes3 --target_device NPU --task text_generation --enable_prefix_caching true --cache_dir .cache --max_prompt_len 4000 +ovms.exe --rest_port 8000 --source_model OpenVINO/Qwen3-8B-int4-cw-ov --model_repository_path models --tool_parser hermes3 --target_device NPU --task text_generation --enable_prefix_caching true --cache_dir .cache --max_prompt_len 4000 ``` ::: :::{tab-item} Qwen3-4B :sync: Qwen3-4B ```bat -ovms.exe --rest_port 8000 --source_model Qwen/Qwen3-4B --model_repository_path models --tool_parser hermes3 --target_device NPU --task text_generation --enable_prefix_caching true --cache_dir .cache --max_prompt_len 4000 -``` -::: -:::{tab-item} Llama-3.1-8B-Instruct -:sync: Llama-3.1-8B-Instruct -```bat -ovms.exe --rest_port 8000 --source_model meta-llama/Llama-3.1-8B-Instruct --model_repository_path models --tool_parser llama3 --target_device NPU --task text_generation --enable_tool_guided_generation true --enable_prefix_caching true --cache_dir .cache --max_prompt_len 4000 +ovms.exe --rest_port 8000 --source_model FluidInference/qwen3-4b-int4-ov-npu --model_repository_path models --tool_parser hermes3 --target_device NPU --task text_generation --enable_prefix_caching true --cache_dir .cache --max_prompt_len 4000 ``` ::: :::{tab-item} Llama-3.2-3B-Instruct :sync: Llama-3.2-3B-Instruct ```bat -ovms.exe --rest_port 8000 --source_model meta-llama/Llama-3.2-3B-Instruct --model_repository_path models --tool_parser llama3 --target_device NPU --task text_generation --enable_tool_guided_generation true --enable_prefix_caching true --cache_dir .cache --max_prompt_len 4000 +ovms.exe --rest_port 8000 --source_model llmware/llama-3.2-3b-instruct-npu-ov --model_repository_path models --tool_parser llama3 --target_device NPU --task text_generation --enable_tool_guided_generation true --enable_prefix_caching true --cache_dir .cache --max_prompt_len 4000 ``` ::: :::{tab-item} Mistral-7B-Instruct-v0.3 :sync: Mistral-7B-Instruct-v0.3 ```bat -ovms.exe --rest_port 8000 --source_model mistralai/Mistral-7B-Instruct-v0.3 --model_repository_path models --tool_parser mistral --target_device NPU --task text_generation --enable_prefix_caching true --cache_dir .cache --max_prompt_len 4000 -``` -::: -:::{tab-item} Qwen3-4B-int4-ov -:sync: Qwen3-4B-int4-ov -```bat -ovms.exe --rest_port 8000 --source_model OpenVINO/Qwen3-4B-int4-ov --model_repository_path models --tool_parser hermes3 --target_device NPU --task text_generation --enable_prefix_caching true --cache_dir .cache --max_prompt_len 4000 -``` -::: -:::{tab-item} Mistral-7B-Instruct-v0.3-cw-int4-ov -:sync: Mistral-7B-Instruct-v0.3-cw-int4-ov -```bat ovms.exe --rest_port 8000 --source_model OpenVINO/Mistral-7B-Instruct-v0.3-int4-cw-ov --model_repository_path models --tool_parser mistral --target_device NPU --task text_generation --enable_prefix_caching true --cache_dir .cache --max_prompt_len 4000 ``` ::: -:::: > **Note:** Setting the `--max_prompt_len` parameter too high may lead to performance degradation. It is recommended to use the smallest value that meets your requirements. @@ -308,77 +106,42 @@ ovms.exe --rest_port 8000 --source_model OpenVINO/Mistral-7B-Instruct-v0.3-int4- :sync: Qwen3-8B ```bash docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model Qwen/Qwen3-8B --tool_parser hermes3 --task text_generation --enable_prefix_caching true +--rest_port 8000 --model_repository_path models --source_model OpenVINO/Qwen3-8B-int4-ov --tool_parser hermes3 --task text_generation --enable_prefix_caching true ``` ::: :::{tab-item} Qwen3-4B :sync: Qwen3-4B ```bash docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model Qwen/Qwen3-4B --tool_parser hermes3 --task text_generation --enable_prefix_caching true -``` -::: -:::{tab-item} Llama-3.1-8B-Instruct -:sync: Llama-3.1-8B-Instruct -```bash -docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model meta-llama/Llama-3.1-8B-Instruct --tool_parser llama3 --task text_generation --enable_prefix_caching true --enable_tool_guided_generation true +--rest_port 8000 --model_repository_path models --source_model OpenVINO/Qwen3-4B-int4-ov --tool_parser hermes3 --task text_generation --enable_prefix_caching true ``` ::: :::{tab-item} Llama-3.2-3B-Instruct :sync: Llama-3.2-3B-Instruct ```bash docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model meta-llama/Llama-3.2-3B-Instruct --tool_parser llama3 --task text_generation --enable_prefix_caching true --enable_tool_guided_generation true -``` -::: -:::{tab-item} Hermes-3-Llama-3.1-8B -:sync: Hermes-3-Llama-3.1-8B -```bash -docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model NousResearch/Hermes-3-Llama-3.1-8B --tool_parser hermes3 --task text_generation --enable_prefix_caching true --enable_tool_guided_generation true +--rest_port 8000 --model_repository_path models --source_model srang992/Llama-3.2-3B-Instruct-ov-INT4 --tool_parser llama3 --task text_generation --enable_prefix_caching true --enable_tool_guided_generation true ``` ::: :::{tab-item} Mistral-7B-Instruct-v0.3 :sync: Mistral-7B-Instruct-v0.3 ```bash docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model mistralai/Mistral-7B-Instruct-v0.3 --tool_parser mistral --task text_generation --enable_prefix_caching true +--rest_port 8000 --model_repository_path models --source_model OpenVINO/Mistral-7B-Instruct-v0.3-int4-ov --tool_parser mistral --task text_generation --enable_prefix_caching true ``` ::: :::{tab-item} Phi-4-mini-instruct :sync: Phi-4-mini-instruct ```bash docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model microsoft/Phi-4-mini-instruct --tool_parser phi4 --task text_generation --enable_prefix_caching true --max_num_batched_tokens 99999 --enable_tool_guided_generation true -``` -::: -:::{tab-item} Qwen3-8B-int4-ov -:sync: Qwen3-8B-int4-ov -```bash -docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model OpenVINO/Qwen3-8B-int4-ov --tool_parser hermes3 --task text_generation --enable_prefix_caching true -``` -::: -:::{tab-item} Mistral-7B-Instruct-v0.3-int4-ov -:sync: Mistral-7B-Instruct-v0.3-int4-ov -```bash -docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model OpenVINO/Mistral-7B-Instruct-v0.3-int4-ov --tool_parser mistral --task text_generation --enable_prefix_caching true -``` -::: -:::{tab-item} Phi-4-mini-instruct-int4-ov -:sync: Phi-4-mini-instruct-int4-ov -```bash -docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model OpenVINO/Phi-4-mini-instruct-int4-ov --tool_parser phi4 --task text_generation --enable_prefix_caching true +--rest_port 8000 --model_repository_path models --source_model OpenVINO/Phi-4-mini-instruct-int4-ov --tool_parser phi4 --task text_generation --enable_prefix_caching true --max_num_batched_tokens 99999 --enable_tool_guided_generation true ``` ::: :::{tab-item} Qwen3-Coder-30B-A3B-Instruct :sync: Qwen3-Coder-30B-A3B-Instruct ```bash docker run -d --user $(id -u):$(id -g) --rm -e MOE_USE_MICRO_GEMM_PREFILL=0 -p 8000:8000 -v $(pwd)/models:/models openvino/model_server:weekly \ ---rest_port 8000 --source_model Qwen/Qwen3-Coder-30B-A3B-Instruct --model_repository_path models --tool_parser qwen3coder --task text_generation --enable_prefix_caching true +--rest_port 8000 --source_model OpenVINO/Qwen3-Coder-30B-A3B-Instruct-int4-ov --model_repository_path models --tool_parser qwen3coder --task text_generation --enable_prefix_caching true ``` ::: :::: @@ -395,84 +158,49 @@ It can be applied using the commands below: :sync: Qwen3-8B ```bash docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model Qwen/Qwen3-8B --tool_parser hermes3 --target_device GPU --task text_generation --enable_prefix_caching true +--rest_port 8000 --model_repository_path models --source_model OpenVINO/Qwen3-8B-int4-ov --tool_parser hermes3 --target_device GPU --task text_generation --enable_prefix_caching true ``` ::: :::{tab-item} Qwen3-4B :sync: Qwen3-4B ```bash docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model Qwen/Qwen3-4B --tool_parser hermes3 --target_device GPU --task text_generation --enable_prefix_caching true -``` -::: -:::{tab-item} Llama-3.1-8B-Instruct -:sync: Llama-3.1-8B-Instruct -```bash -docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model meta-llama/Llama-3.1-8B-Instruct --tool_parser llama3 --target_device GPU --task text_generation --enable_tool_guided_generation true --enable_prefix_caching true +--rest_port 8000 --model_repository_path models --source_model OpenVINO/Qwen3-4B-int4-ov --tool_parser hermes3 --target_device GPU --task text_generation --enable_prefix_caching true ``` ::: :::{tab-item} Llama-3.2-3B-Instruct :sync: Llama-3.2-3B-Instruct ```bash docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model meta-llama/Llama-3.2-3B-Instruct --tool_parser llama3 --target_device GPU --task text_generation --enable_tool_guided_generation true --enable_prefix_caching true -``` -::: -:::{tab-item} Hermes-3-Llama-3.1-8B -:sync: Hermes-3-Llama-3.1-8B -```bash -docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model NousResearch/Hermes-3-Llama-3.1-8B --tool_parser llama3 --target_device GPU --task text_generation --enable_tool_guided_generation true --enable_prefix_caching true +--rest_port 8000 --model_repository_path models --source_model srang992/Llama-3.2-3B-Instruct-ov-INT4 --tool_parser llama3 --target_device GPU --task text_generation --enable_tool_guided_generation true --enable_prefix_caching true ``` ::: :::{tab-item} Mistral-7B-Instruct-v0.3 :sync: Mistral-7B-Instruct-v0.3 ```bash docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model mistralai/Mistral-7B-Instruct-v0.3 --tool_parser mistral --target_device GPU --task text_generation --enable_prefix_caching true +--rest_port 8000 --model_repository_path models --source_model OpenVINO/Mistral-7B-Instruct-v0.3-int4-ov --tool_parser mistral --target_device GPU --task text_generation --enable_prefix_caching true ``` ::: :::{tab-item} Phi-4-mini-instruct :sync: Phi-4-mini-instruct ```bash docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model microsoft/Phi-4-mini-instruct --tool_parser phi4 --target_device GPU --task text_generation --max_num_batched_tokens 99999 --enable_prefix_caching true -``` -::: -:::{tab-item} Qwen3-8B-int4-ov -:sync: Qwen3-8B-int4-ov -```bash -docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model OpenVINO/Qwen3-8B-int4-ov --tool_parser hermes3 --target_device GPU --task text_generation --enable_prefix_caching true -``` -::: -:::{tab-item} Mistral-7B-Instruct-v0.3-int4-ov -:sync: Mistral-7B-Instruct-v0.3-int4-ov -```bash -docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model OpenVINO/Mistral-7B-Instruct-v0.3-int4-ov --tool_parser mistral --target_device GPU --task text_generation --enable_prefix_caching true -``` -::: -:::{tab-item} Phi-4-mini-instruct-int4-ov -:sync: Phi-4-mini-instruct-int4-ov -```bash -docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model OpenVINO/Phi-4-mini-instruct-int4-ov --tool_parser phi4 --target_device GPU --task text_generation --enable_tool_guided_generation true --enable_prefix_caching true +--rest_port 8000 --model_repository_path models --source_model OpenVINO/Phi-4-mini-instruct-int4-ov --tool_parser phi4 --target_device GPU --task text_generation --max_num_batched_tokens 99999 --enable_prefix_caching true ``` ::: :::{tab-item} Qwen3-Coder-30B-A3B-Instruct :sync: Qwen3-Coder-30B-A3B-Instruct ```bash docker run -d --user $(id -u):$(id -g) -e MOE_USE_MICRO_GEMM_PREFILL=0 --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly \ ---rest_port 8000 --source_model Qwen/Qwen3-Coder-30B-A3B-Instruct --model_repository_path models --tool_parser qwen3coder --target_device GPU --task text_generation --enable_tool_guided_generation true --enable_prefix_caching true +--rest_port 8000 --source_model OpenVINO/Qwen3-Coder-30B-A3B-Instruct-int4-ov --model_repository_path models --tool_parser qwen3coder --target_device GPU --task text_generation --enable_tool_guided_generation true --enable_prefix_caching true ``` ::: :::{tab-item} gpt-oss-20b :sync: gpt-oss-20b ```bash docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly \ ---rest_port 8000 --source_model openai/gpt-oss-20b --model_repository_path models \ +--rest_port 8000 --source_model OpenVINO/gpt-oss-20b-int4-ov --model_repository_path models \ --tool_parser gptoss --reasoning_parser gptoss --target_device GPU --task text_generation --enable_prefix_caching true ``` > **Note:** Continuous batching and paged attention are supported for GPT‑OSS. However, when deployed on GPU, the model may experience reduced accuracy under high‑concurrency workloads. This issue will be resolved in version 2026.1 and in the upcoming weekly release. CPU execution is not affected. @@ -491,52 +219,30 @@ It can be applied using the commands below: :sync: Qwen3-8B ```bash docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/accel --group-add=$(stat -c "%g" /dev/dri/render* | head -1) openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model Qwen/Qwen3-8B --tool_parser hermes3 --target_device NPU --task text_generation --enable_prefix_caching true --max_prompt_len 4000 +--rest_port 8000 --model_repository_path models --source_model OpenVINO/Qwen3-8B-int4-cw-ov --tool_parser hermes3 --target_device NPU --task text_generation --enable_prefix_caching true --max_prompt_len 4000 ``` ::: :::{tab-item} Qwen3-4B :sync: Qwen3-4B ```bash docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/accel --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model Qwen/Qwen3-4B --tool_parser hermes3 --target_device NPU --task text_generation --enable_prefix_caching true --max_prompt_len 4000 -``` -::: -:::{tab-item} Llama-3.1-8B-Instruct -:sync: Llama-3.1-8B-Instruct -```bash -docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/accel --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model meta-llama/Llama-3.1-8B-Instruct --tool_parser llama3 --target_device NPU --task text_generation --enable_tool_guided_generation true --enable_prefix_caching true --max_prompt_len 4000 +--rest_port 8000 --model_repository_path models --source_model FluidInference/qwen3-4b-int4-ov-npu --tool_parser hermes3 --target_device NPU --task text_generation --enable_prefix_caching true --max_prompt_len 4000 ``` ::: :::{tab-item} Llama-3.2-3B-Instruct :sync: Llama-3.2-3B-Instruct ```bash docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/accel --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model meta-llama/Llama-3.2-3B-Instruct --tool_parser llama3 --target_device NPU --task text_generation --enable_tool_guided_generation true --enable_prefix_caching true --max_prompt_len 4000 +--rest_port 8000 --model_repository_path models --source_model llmware/llama-3.2-3b-instruct-npu-ov --tool_parser llama3 --target_device NPU --task text_generation --enable_tool_guided_generation true --enable_prefix_caching true --max_prompt_len 4000 ``` ::: :::{tab-item} Mistral-7B-Instruct-v0.3 :sync: Mistral-7B-Instruct-v0.3 ```bash docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/accel --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model mistralai/Mistral-7B-Instruct-v0.3 --tool_parser mistral --target_device NPU --task text_generation --enable_prefix_caching true --max_prompt_len 4000 -``` -::: -:::{tab-item} Qwen3-8B-int4-cw-ov -:sync: Qwen3-8B-int4-cw-ov -```bash -docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/accel --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model OpenVINO/Qwen3-8B-int4-cw-ov --tool_parser hermes3 --target_device NPU --task text_generation --enable_prefix_caching true --max_prompt_len 4000 -``` -::: -:::{tab-item} Mistral-7B-Instruct-v0.3-int4-cw-ov -:sync: Mistral-7B-Instruct-v0.3-int4-cw-ov -```bash -docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/accel --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly \ --rest_port 8000 --model_repository_path models --source_model OpenVINO/Mistral-7B-Instruct-v0.3-int4-cw-ov --tool_parser mistral --target_device NPU --task text_generation --enable_prefix_caching true --max_prompt_len 4000 ``` ::: -:::: ### Deploy all models in a single container Those steps deploy all the models exported earlier. The python script added the models to `models/config.json` so just the remaining models pulled directly from HuggingFace Hub are to be added: @@ -583,73 +289,49 @@ Run the agentic application: :::{tab-item} Qwen3-8B :sync: Qwen3-8B ```bash -python openai_agent.py --query "What is the current weather in Tokyo?" --model Qwen/Qwen3-8B --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather --stream --enable-thinking +python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Qwen3-8B-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather --stream --enable-thinking ``` ```bash -python openai_agent.py --query "List the files in folder /root" --model Qwen/Qwen3-8B --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server all +python openai_agent.py --query "List the files in folder /root" --model OpenVINO/Qwen3-8B-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server all ``` ::: :::{tab-item} Qwen3-4B :sync: Qwen3-4B ```bash -python openai_agent.py --query "What is the current weather in Tokyo?" --model Qwen/Qwen3-4B --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather --stream +python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Qwen3-4B-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather --stream ``` ```bash -python openai_agent.py --query "List the files in folder /root" --model Qwen/Qwen3-4B --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server all -``` -::: -:::{tab-item} Llama-3.1-8B-Instruct -:sync: Llama-3.1-8B-Instruct -```bash -python openai_agent.py --query "List the files in folder /root" --model meta-llama/Llama-3.1-8B-Instruct --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server all +python openai_agent.py --query "List the files in folder /root" --model OpenVINO/Qwen3-4B-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server all ``` ::: :::{tab-item} Mistral-7B-Instruct-v0.3 :sync: Mistral-7B-Instruct-v0.3 ```bash -python openai_agent.py --query "List the files in folder /root" --model mistralai/Mistral-7B-Instruct-v0.3 --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server all --tool_choice required +python openai_agent.py --query "List the files in folder /root" --model OpenVINO/Mistral-7B-Instruct-v0.3-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server all --tool_choice required ``` ::: :::{tab-item} Llama-3.2-3B-Instruct :sync: Llama-3.2-3B-Instruct ```bash -python openai_agent.py --query "List the files in folder /root" --model meta-llama/Llama-3.2-3B-Instruct --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server all +python openai_agent.py --query "List the files in folder /root" --model srang992/Llama-3.2-3B-Instruct-ov-INT4 --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server all ``` ::: :::{tab-item} Phi-4-mini-instruct :sync: Phi-4-mini-instruct ```bash -python openai_agent.py --query "What is the current weather in Tokyo?" --model microsoft/Phi-4-mini-instruct --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather -``` -::: -:::{tab-item} Qwen3-8B-int4-ov -:sync: Qwen3-8B-int4-ov -```bash -python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Qwen3-8B-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather -``` -::: -:::{tab-item} OpenVINO/Mistral-7B-Instruct-v0.3-int4-ov -:sync: OpenVINO/Mistral-7B-Instruct-v0.3-int4-ov -```bash -python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Mistral-7B-Instruct-v0.3-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather --tool-choice required -``` -::: -:::{tab-item} Phi-4-mini-instruct-int4-ov -:sync: Phi-4-mini-instruct-int4-ov -```bash python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Phi-4-mini-instruct-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather ``` ::: :::{tab-item} Qwen3-Coder-30B-A3B-Instruct :sync: Qwen3-Coder-30B-A3B-Instruct ```bash -python openai_agent.py --query "What is the current weather in Tokyo?" --model Qwen3/Qwen3-Coder-30B-A3B-Instruct --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather +python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Qwen3-Coder-30B-A3B-Instruct-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather ``` ::: :::{tab-item} gpt-oss-20b :sync: gpt-oss-20b ```console -python openai_agent.py --query "What is the current weather in Tokyo?" --model openai/gpt-oss-20b --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather +python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/gpt-oss-20b-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather ``` ::: :::: @@ -663,7 +345,6 @@ You can try also similar implementation based on llama_index library working the pip install llama-index-llms-openai-like==0.5.3 llama-index-core==0.14.5 llama-index-tools-mcp==0.4.2 curl https://raw.githubusercontent.com/openvinotoolkit/model_server/main/demos/continuous_batching/agentic_ai/llama_index_agent.py -o llama_index_agent.py python llama_index_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Qwen3-8B-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather --stream --enable-thinking - ``` @@ -685,12 +366,12 @@ mv 1184.txt.utf-8 pg1184.txt docker run -d --name ovms --user $(id -u):$(id -g) --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) --rm -p 8000:8000 -v $(pwd)/models:/models openvino/model_server:weekly \ --rest_port 8000 --model_repository_path /models --source_model OpenVINO/Qwen3-8B-int4-ov --enable_prefix_caching true --task text_generation --target_device GPU -python benchmark_serving_multi_turn.py -m Qwen/Qwen3-8B --url http://localhost:8000/v3 -i generate_multi_turn.json --served-model-name OpenVINO/Qwen3-8B-int4-ov --num-clients 1 -n 50 +python benchmark_serving_multi_turn.py -m OpenVINO/Qwen3-8B-int4-ov --url http://localhost:8000/v3 -i generate_multi_turn.json --served-model-name OpenVINO/Qwen3-8B-int4-ov --num-clients 1 -n 50 # Testing high concurrency, for example on Xeon CPU with constrained resources (in case of memory constrains, reduce cache_size) docker run -d --name ovms --cpuset-cpus 0-15 --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models openvino/model_server:weekly --rest_port 8000 --model_repository_path /models --source_model OpenVINO/Qwen3-8B-int4-ov --enable_prefix_caching true --cache_size 20 --task text_generation -python benchmark_serving_multi_turn.py -m Qwen/Qwen3-8B --url http://localhost:8000/v3 -i generate_multi_turn.json --served-model-name OpenVINO/Qwen3-8B-int4-ov --num-clients 24 +python benchmark_serving_multi_turn.py -m OpenVINO/Qwen3-8B-int4-ov --url http://localhost:8000/v3 -i generate_multi_turn.json --served-model-name OpenVINO/Qwen3-8B-int4-ov --num-clients 24 ``` Below is an example of the output captured on iGPU: ``` @@ -741,3 +422,24 @@ Here is example of the response from the OpenVINO/Qwen3-8B-int4-ov model: ``` Models can be also compared using the [leaderboard reports](https://gorilla.cs.berkeley.edu/leaderboard.html#leaderboard). + +### Export using python script + +Use those steps to convert the model from HugginFace Hub to OpenVINO format and export it to a local storage. + +```console +# Download export script, install its dependencies and create directory for the models +curl https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/common/export_models/export_model.py -o export_model.py +pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/common/export_models/requirements.txt +mkdir models +``` +Run `export_model.py` script to download and quantize the model: + +> **Note:** The users in China need to set environment variable HF_ENDPOINT="https://hf-mirror.com" or "https://www.modelscope.cn/models" before running the export script to connect to the HF Hub. + +```console +python export_model.py text_generation --source_model meta-llama/Llama-3.2-3B-Instruct --weight-format int8 --config_file_path models/config.json --model_repository_path models --tool_parser llama3 +curl -L -o models/meta-llama/Llama-3.2-3B-Instruct/chat_template.jinja https://raw.githubusercontent.com/vllm-project/vllm/refs/tags/v0.9.0/examples/tool_chat_template_llama3.2_json.jinja +``` + +> **Note:** To use these models on NPU, set `--weight-format` to either **int4** or **nf4**. When specifying `--extra_quantization_params`, ensure that `ratio` is set to **1.0** and `group_size` is set to **-1** or **128**. For more details, see [OpenVINO GenAI on NPU](https://docs.openvino.ai/nightly/openvino-workflow-generative/inference-with-genai/inference-with-genai-on-npu.html). \ No newline at end of file From 35177a3297b4ab23b1e4022b2c95a5a60a8beee9 Mon Sep 17 00:00:00 2001 From: Pawel Date: Wed, 4 Mar 2026 13:08:11 +0100 Subject: [PATCH 03/22] save --- demos/continuous_batching/agentic_ai/README.md | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/demos/continuous_batching/agentic_ai/README.md b/demos/continuous_batching/agentic_ai/README.md index dda38439f4..fab1429200 100644 --- a/demos/continuous_batching/agentic_ai/README.md +++ b/demos/continuous_batching/agentic_ai/README.md @@ -288,43 +288,43 @@ Run the agentic application: ::::{tab-set} :::{tab-item} Qwen3-8B :sync: Qwen3-8B -```bash +```text python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Qwen3-8B-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather --stream --enable-thinking ``` -```bash +```text python openai_agent.py --query "List the files in folder /root" --model OpenVINO/Qwen3-8B-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server all ``` ::: :::{tab-item} Qwen3-4B :sync: Qwen3-4B -```bash +```text python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Qwen3-4B-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather --stream ``` -```bash +```text python openai_agent.py --query "List the files in folder /root" --model OpenVINO/Qwen3-4B-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server all ``` ::: :::{tab-item} Mistral-7B-Instruct-v0.3 :sync: Mistral-7B-Instruct-v0.3 -```bash +```text python openai_agent.py --query "List the files in folder /root" --model OpenVINO/Mistral-7B-Instruct-v0.3-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server all --tool_choice required ``` ::: :::{tab-item} Llama-3.2-3B-Instruct :sync: Llama-3.2-3B-Instruct -```bash +```text python openai_agent.py --query "List the files in folder /root" --model srang992/Llama-3.2-3B-Instruct-ov-INT4 --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server all ``` ::: :::{tab-item} Phi-4-mini-instruct :sync: Phi-4-mini-instruct -```bash +```text python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Phi-4-mini-instruct-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather ``` ::: :::{tab-item} Qwen3-Coder-30B-A3B-Instruct :sync: Qwen3-Coder-30B-A3B-Instruct -```bash +```text python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Qwen3-Coder-30B-A3B-Instruct-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather ``` ::: From 7a87239a5948500c84e5709ea5cca68e7fdd30ee Mon Sep 17 00:00:00 2001 From: Pawel Date: Wed, 4 Mar 2026 13:32:31 +0100 Subject: [PATCH 04/22] save --- demos/continuous_batching/agentic_ai/README.md | 10 ---------- 1 file changed, 10 deletions(-) diff --git a/demos/continuous_batching/agentic_ai/README.md b/demos/continuous_batching/agentic_ai/README.md index fab1429200..10c32a1bb3 100644 --- a/demos/continuous_batching/agentic_ai/README.md +++ b/demos/continuous_batching/agentic_ai/README.md @@ -244,16 +244,6 @@ docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/mode ``` ::: -### Deploy all models in a single container -Those steps deploy all the models exported earlier. The python script added the models to `models/config.json` so just the remaining models pulled directly from HuggingFace Hub are to be added: -```bash -docker run --user $(id -u):$(id -g) --rm -v $(pwd)/models:/models:rw openvino/model_server:weekly --add_to_config --model_name OpenVINO/Qwen3-8B-int4-ov --model_path OpenVINO/Qwen3-8B-int4-ov --config_path /models/config.json -docker run --user $(id -u):$(id -g) --rm -v $(pwd)/models:/models:rw openvino/model_server:weekly --add_to_config --model_name OpenVINO/Phi-4-mini-instruct-int4-ov --model_path OpenVINO/Phi-4-mini-instruct-int4-ov --config_path /models/config.json -docker run --user $(id -u):$(id -g) --rm -v $(pwd)/models:/models:rw openvino/model_server:weekly --add_to_config --model_name OpenVINO/Mistral-7B-Instruct-v0.3-int4-ov --model_path OpenVINO/Mistral-7B-Instruct-v0.3-int4-ov--config_path /models/config.json -docker run -d --rm -p 8000:8000 -v $(pwd)/models:/models:ro openvino/model_server:weekly --rest_port 8000 --config_path /models/config.json -``` - - ## Start MCP server with SSE interface ### Linux From d19be9af43aea543dbe75b38cf1bf1d53d5762bb Mon Sep 17 00:00:00 2001 From: Pawel Date: Fri, 6 Mar 2026 07:49:59 +0100 Subject: [PATCH 05/22] save --- .../continuous_batching/agentic_ai/README.md | 258 +++++------------- 1 file changed, 61 insertions(+), 197 deletions(-) diff --git a/demos/continuous_batching/agentic_ai/README.md b/demos/continuous_batching/agentic_ai/README.md index 10c32a1bb3..62637275d1 100644 --- a/demos/continuous_batching/agentic_ai/README.md +++ b/demos/continuous_batching/agentic_ai/README.md @@ -12,6 +12,34 @@ The tools can also be used for automation purposes based on input in text format > **Note:** On Windows, make sure to use the weekly or 2025.4 release packages for proper functionality. +## Start MCP server with SSE interface + +### Linux +```bash +git clone https://github.com/isdaniel/mcp_weather_server +cd mcp_weather_server && git checkout v0.5.0 +docker build -t mcp-weather-server:sse . +docker run -d -p 8080:8080 -e PORT=8080 mcp-weather-server:sse uv run python -m mcp_weather_server --mode sse +``` + +> **Note:** On Windows the MCP server will be demonstrated as an instance with stdio interface inside the agent application + +## Start the agent + +Install the application requirements + +```console +curl https://raw.githubusercontent.com/openvinotoolkit/model_server/main/demos/continuous_batching/agentic_ai/openai_agent.py -o openai_agent.py +pip install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/main/demos/continuous_batching/agentic_ai/requirements.txt +``` +Make sure nodejs and npx are installed. On ubuntu it would require `sudo apt install nodejs npm`. On windows, visit https://nodejs.org/en/download. It is needed for the `file system` MCP server. + +For windows applications it may be required to set environmental variable to enforce utf-8 encodeing in python: + +```bat +set PYTHONUTF8=1 +``` + ## Start OVMS This deployment procedure assumes the model was pulled or exported using the procedure above. The exception are models from OpenVINO organization if they support tools correctly with the default template like "OpenVINO/Qwen3-8B-int4-ov" - they can be deployed in a single command pulling and staring the server. @@ -84,18 +112,6 @@ ovms.exe --rest_port 8000 --source_model OpenVINO/Qwen3-8B-int4-cw-ov --model_re ovms.exe --rest_port 8000 --source_model FluidInference/qwen3-4b-int4-ov-npu --model_repository_path models --tool_parser hermes3 --target_device NPU --task text_generation --enable_prefix_caching true --cache_dir .cache --max_prompt_len 4000 ``` ::: -:::{tab-item} Llama-3.2-3B-Instruct -:sync: Llama-3.2-3B-Instruct -```bat -ovms.exe --rest_port 8000 --source_model llmware/llama-3.2-3b-instruct-npu-ov --model_repository_path models --tool_parser llama3 --target_device NPU --task text_generation --enable_tool_guided_generation true --enable_prefix_caching true --cache_dir .cache --max_prompt_len 4000 -``` -::: -:::{tab-item} Mistral-7B-Instruct-v0.3 -:sync: Mistral-7B-Instruct-v0.3 -```bat -ovms.exe --rest_port 8000 --source_model OpenVINO/Mistral-7B-Instruct-v0.3-int4-cw-ov --model_repository_path models --tool_parser mistral --target_device NPU --task text_generation --enable_prefix_caching true --cache_dir .cache --max_prompt_len 4000 -``` -::: > **Note:** Setting the `--max_prompt_len` parameter too high may lead to performance degradation. It is recommended to use the smallest value that meets your requirements. @@ -108,6 +124,14 @@ ovms.exe --rest_port 8000 --source_model OpenVINO/Mistral-7B-Instruct-v0.3-int4- docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models openvino/model_server:weekly \ --rest_port 8000 --model_repository_path models --source_model OpenVINO/Qwen3-8B-int4-ov --tool_parser hermes3 --task text_generation --enable_prefix_caching true ``` + +```bash +python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Qwen3-8B-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather +``` + +```text +The current weather in Tokyo is clear sky with a temperature of 8.3°C (feels like 5.0°C). The relative humidity is at 50%, and the dew point is -1.5°C. Wind is blowing from the NNW at 6.8 km/h with gusts up to 21.2 km/h. The atmospheric pressure is 1021.5 hPa with 0% cloud cover, and visibility is 24.1 km. +``` ::: :::{tab-item} Qwen3-4B :sync: Qwen3-4B @@ -115,26 +139,13 @@ docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/model docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models openvino/model_server:weekly \ --rest_port 8000 --model_repository_path models --source_model OpenVINO/Qwen3-4B-int4-ov --tool_parser hermes3 --task text_generation --enable_prefix_caching true ``` -::: -:::{tab-item} Llama-3.2-3B-Instruct -:sync: Llama-3.2-3B-Instruct -```bash -docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model srang992/Llama-3.2-3B-Instruct-ov-INT4 --tool_parser llama3 --task text_generation --enable_prefix_caching true --enable_tool_guided_generation true -``` -::: -:::{tab-item} Mistral-7B-Instruct-v0.3 -:sync: Mistral-7B-Instruct-v0.3 + ```bash -docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model OpenVINO/Mistral-7B-Instruct-v0.3-int4-ov --tool_parser mistral --task text_generation --enable_prefix_caching true +python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Qwen3-4B-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather ``` -::: -:::{tab-item} Phi-4-mini-instruct -:sync: Phi-4-mini-instruct -```bash -docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model OpenVINO/Phi-4-mini-instruct-int4-ov --tool_parser phi4 --task text_generation --enable_prefix_caching true --max_num_batched_tokens 99999 --enable_tool_guided_generation true + +```text +The current weather in Tokyo is clear with a temperature of 8.3°C (feels like 5.0°C). The relative humidity is at 50%, and the dew point is at -1.5°C. Winds are coming from the NNW at 6.8 km/h with gusts up to 21.2 km/h. The atmospheric pressure is 1021.5 hPa with 0% cloud cover. Visibility is 24.1 km. ``` ::: :::{tab-item} Qwen3-Coder-30B-A3B-Instruct @@ -143,6 +154,14 @@ docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/mode docker run -d --user $(id -u):$(id -g) --rm -e MOE_USE_MICRO_GEMM_PREFILL=0 -p 8000:8000 -v $(pwd)/models:/models openvino/model_server:weekly \ --rest_port 8000 --source_model OpenVINO/Qwen3-Coder-30B-A3B-Instruct-int4-ov --model_repository_path models --tool_parser qwen3coder --task text_generation --enable_prefix_caching true ``` + +```bash +python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Qwen3-Coder-30B-A3B-Instruct-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather +``` + +```text +The current weather in Tokyo is clear sky with a temperature of 5.5°C (feels like 2.8°C). The relative humidity is at 64%, and the dew point is -0.8°C. Wind is blowing from the NNE at 3.2 km/h with gusts up to 10.8 km/h. The atmospheric pressure is 1023.4 hPa with 0% cloud cover. Visibility is 24.1 km. +``` ::: :::: @@ -168,27 +187,6 @@ docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/model --rest_port 8000 --model_repository_path models --source_model OpenVINO/Qwen3-4B-int4-ov --tool_parser hermes3 --target_device GPU --task text_generation --enable_prefix_caching true ``` ::: -:::{tab-item} Llama-3.2-3B-Instruct -:sync: Llama-3.2-3B-Instruct -```bash -docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model srang992/Llama-3.2-3B-Instruct-ov-INT4 --tool_parser llama3 --target_device GPU --task text_generation --enable_tool_guided_generation true --enable_prefix_caching true -``` -::: -:::{tab-item} Mistral-7B-Instruct-v0.3 -:sync: Mistral-7B-Instruct-v0.3 -```bash -docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model OpenVINO/Mistral-7B-Instruct-v0.3-int4-ov --tool_parser mistral --target_device GPU --task text_generation --enable_prefix_caching true -``` -::: -:::{tab-item} Phi-4-mini-instruct -:sync: Phi-4-mini-instruct -```bash -docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model OpenVINO/Phi-4-mini-instruct-int4-ov --tool_parser phi4 --target_device GPU --task text_generation --max_num_batched_tokens 99999 --enable_prefix_caching true -``` -::: :::{tab-item} Qwen3-Coder-30B-A3B-Instruct :sync: Qwen3-Coder-30B-A3B-Instruct ```bash @@ -221,6 +219,14 @@ It can be applied using the commands below: docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/accel --group-add=$(stat -c "%g" /dev/dri/render* | head -1) openvino/model_server:weekly \ --rest_port 8000 --model_repository_path models --source_model OpenVINO/Qwen3-8B-int4-cw-ov --tool_parser hermes3 --target_device NPU --task text_generation --enable_prefix_caching true --max_prompt_len 4000 ``` + +```bash +python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Qwen3-8B-int4-cw-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather +``` + +```text +The current weather in Tokyo is clear sky with a temperature of 8.3°C (feels like 5.0°C). The relative humidity is at 50%, and the dew point is at -1.5°C. The wind is blowing from the NNW at 6.8 km/h with gusts up to 21.2 km/h. The atmospheric pressure is 1021.5 hPa with 0% cloud cover, and the visibility is 24.1 km. +``` ::: :::{tab-item} Qwen3-4B :sync: Qwen3-4B @@ -228,100 +234,13 @@ docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/model docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/accel --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly \ --rest_port 8000 --model_repository_path models --source_model FluidInference/qwen3-4b-int4-ov-npu --tool_parser hermes3 --target_device NPU --task text_generation --enable_prefix_caching true --max_prompt_len 4000 ``` -::: -:::{tab-item} Llama-3.2-3B-Instruct -:sync: Llama-3.2-3B-Instruct -```bash -docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/accel --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model llmware/llama-3.2-3b-instruct-npu-ov --tool_parser llama3 --target_device NPU --task text_generation --enable_tool_guided_generation true --enable_prefix_caching true --max_prompt_len 4000 -``` -::: -:::{tab-item} Mistral-7B-Instruct-v0.3 -:sync: Mistral-7B-Instruct-v0.3 -```bash -docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/accel --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model OpenVINO/Mistral-7B-Instruct-v0.3-int4-cw-ov --tool_parser mistral --target_device NPU --task text_generation --enable_prefix_caching true --max_prompt_len 4000 -``` -::: - -## Start MCP server with SSE interface -### Linux ```bash -git clone https://github.com/isdaniel/mcp_weather_server -cd mcp_weather_server && git checkout v0.5.0 -docker build -t mcp-weather-server:sse . -docker run -d -p 8080:8080 -e PORT=8080 mcp-weather-server:sse uv run python -m mcp_weather_server --mode sse -``` - -> **Note:** On Windows the MCP server will be demonstrated as an instance with stdio interface inside the agent application - -## Start the agent - -Install the application requirements - -```console -curl https://raw.githubusercontent.com/openvinotoolkit/model_server/main/demos/continuous_batching/agentic_ai/openai_agent.py -o openai_agent.py -pip install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/main/demos/continuous_batching/agentic_ai/requirements.txt -``` -Make sure nodejs and npx are installed. On ubuntu it would require `sudo apt install nodejs npm`. On windows, visit https://nodejs.org/en/download. It is needed for the `file system` MCP server. - -For windows applications it may be required to set environmental variable to enforce utf-8 encodeing in python: - -```bat -set PYTHONUTF8=1 +python openai_agent.py --query "What is the current weather in Tokyo?" --model FluidInference/qwen3-4b-int4-ov-npu --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather --stream ``` -Run the agentic application: - - -::::{tab-set} -:::{tab-item} Qwen3-8B -:sync: Qwen3-8B ```text -python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Qwen3-8B-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather --stream --enable-thinking -``` -```text -python openai_agent.py --query "List the files in folder /root" --model OpenVINO/Qwen3-8B-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server all -``` -::: -:::{tab-item} Qwen3-4B -:sync: Qwen3-4B -```text -python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Qwen3-4B-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather --stream -``` -```text -python openai_agent.py --query "List the files in folder /root" --model OpenVINO/Qwen3-4B-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server all -``` -::: -:::{tab-item} Mistral-7B-Instruct-v0.3 -:sync: Mistral-7B-Instruct-v0.3 -```text -python openai_agent.py --query "List the files in folder /root" --model OpenVINO/Mistral-7B-Instruct-v0.3-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server all --tool_choice required -``` -::: -:::{tab-item} Llama-3.2-3B-Instruct -:sync: Llama-3.2-3B-Instruct -```text -python openai_agent.py --query "List the files in folder /root" --model srang992/Llama-3.2-3B-Instruct-ov-INT4 --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server all -``` -::: -:::{tab-item} Phi-4-mini-instruct -:sync: Phi-4-mini-instruct -```text -python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Phi-4-mini-instruct-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather -``` -::: -:::{tab-item} Qwen3-Coder-30B-A3B-Instruct -:sync: Qwen3-Coder-30B-A3B-Instruct -```text -python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Qwen3-Coder-30B-A3B-Instruct-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather -``` -::: -:::{tab-item} gpt-oss-20b -:sync: gpt-oss-20b -```console -python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/gpt-oss-20b-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather +The current weather in Tokyo is clear sky with a temperature of 8.3°C (feels like 5.0°C). The relative humidity is at 50%, and the dew point is at -1.5°C. There is a wind blowing from the NNW at 6.8 km/h with gusts up to 21.2 km/h. The atmospheric pressure is 1021.5 hPa with 0% cloud cover. The visibility is 24.1 km. ``` ::: :::: @@ -337,61 +256,6 @@ curl https://raw.githubusercontent.com/openvinotoolkit/model_server/main/demos/c python llama_index_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Qwen3-8B-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather --stream --enable-thinking ``` - -## Testing efficiency in agentic use case - -Using LLM models with AI agents has a unique load characteristics with multi-turn communication and resending bit parts of the prompt as the previous conversation. -To simulate such type of load, we should use a dedicated tool [multi_turn benchmark](https://github.com/vllm-project/vllm/tree/main/benchmarks/multi_turn). -```bash -git clone -b v0.10.2 https://github.com/vllm-project/vllm -cd vllm/benchmarks/multi_turn -pip install -r requirements.txt -sed -i -e 's/if not os.path.exists(args.model)/if 1 == 0/g' benchmark_serving_multi_turn.py - -#Download the following text file (used for generation of synthetic conversations) -wget https://www.gutenberg.org/ebooks/1184.txt.utf-8 -mv 1184.txt.utf-8 pg1184.txt - -# Testing single client scenario, for example with GPU execution -docker run -d --name ovms --user $(id -u):$(id -g) --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) --rm -p 8000:8000 -v $(pwd)/models:/models openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path /models --source_model OpenVINO/Qwen3-8B-int4-ov --enable_prefix_caching true --task text_generation --target_device GPU - -python benchmark_serving_multi_turn.py -m OpenVINO/Qwen3-8B-int4-ov --url http://localhost:8000/v3 -i generate_multi_turn.json --served-model-name OpenVINO/Qwen3-8B-int4-ov --num-clients 1 -n 50 - -# Testing high concurrency, for example on Xeon CPU with constrained resources (in case of memory constrains, reduce cache_size) -docker run -d --name ovms --cpuset-cpus 0-15 --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models openvino/model_server:weekly --rest_port 8000 --model_repository_path /models --source_model OpenVINO/Qwen3-8B-int4-ov --enable_prefix_caching true --cache_size 20 --task text_generation - -python benchmark_serving_multi_turn.py -m OpenVINO/Qwen3-8B-int4-ov --url http://localhost:8000/v3 -i generate_multi_turn.json --served-model-name OpenVINO/Qwen3-8B-int4-ov --num-clients 24 -``` -Below is an example of the output captured on iGPU: -``` -Parameters: -model=OpenVINO/Qwen3-8B-int4-ov -num_clients=1 -num_conversations=100 -active_conversations=None -seed=0 -Conversations Generation Parameters: -text_files=pg1184.txt -input_num_turns=UniformDistribution[12, 18] -input_common_prefix_num_tokens=Constant[500] -input_prefix_num_tokens=LognormalDistribution[6, 4] -input_num_tokens=UniformDistribution[120, 160] -output_num_tokens=UniformDistribution[80, 120] ----------------------------------------------------------------------------------------------------- -Statistics summary: -runtime_sec = 307.569 -requests_per_sec = 0.163 ----------------------------------------------------------------------------------------------------- - count mean std min 25% 50% 75% 90% max -ttft_ms 50.0 1052.97 987.30 200.61 595.29 852.08 1038.50 1193.38 4265.27 -tpot_ms 50.0 51.37 2.37 47.03 49.67 51.45 53.16 54.42 55.23 -latency_ms 50.0 6128.26 1093.40 4603.86 5330.43 5995.30 6485.20 7333.73 9505.51 -input_num_turns 50.0 7.64 4.72 1.00 3.00 7.00 11.00 15.00 17.00 -input_num_tokens 50.0 2298.92 973.02 520.00 1556.50 2367.00 3100.75 3477.70 3867.00 -``` - - ## Testing accuracy Testing model accuracy is critical for a successful adoption in AI application. The recommended methodology is to use BFCL tool like describe in the [testing guide](../accuracy/README.md#running-the-tests-for-agentic-models-with-function-calls). @@ -417,7 +281,7 @@ Models can be also compared using the [leaderboard reports](https://gorilla.cs.b Use those steps to convert the model from HugginFace Hub to OpenVINO format and export it to a local storage. -```console +```text # Download export script, install its dependencies and create directory for the models curl https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/common/export_models/export_model.py -o export_model.py pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/common/export_models/requirements.txt @@ -427,7 +291,7 @@ Run `export_model.py` script to download and quantize the model: > **Note:** The users in China need to set environment variable HF_ENDPOINT="https://hf-mirror.com" or "https://www.modelscope.cn/models" before running the export script to connect to the HF Hub. -```console +```text python export_model.py text_generation --source_model meta-llama/Llama-3.2-3B-Instruct --weight-format int8 --config_file_path models/config.json --model_repository_path models --tool_parser llama3 curl -L -o models/meta-llama/Llama-3.2-3B-Instruct/chat_template.jinja https://raw.githubusercontent.com/vllm-project/vllm/refs/tags/v0.9.0/examples/tool_chat_template_llama3.2_json.jinja ``` From 991b4865289195446d292180adef08f060701ed6 Mon Sep 17 00:00:00 2001 From: Pawel Date: Fri, 6 Mar 2026 12:12:45 +0100 Subject: [PATCH 06/22] save --- .../continuous_batching/agentic_ai/README.md | 101 ++++++++++++++++-- 1 file changed, 94 insertions(+), 7 deletions(-) diff --git a/demos/continuous_batching/agentic_ai/README.md b/demos/continuous_batching/agentic_ai/README.md index 62637275d1..21156fd078 100644 --- a/demos/continuous_batching/agentic_ai/README.md +++ b/demos/continuous_batching/agentic_ai/README.md @@ -68,12 +68,6 @@ ovms.exe --rest_port 8000 --source_model OpenVINO/Qwen3-4B-int4-ov --model_repos ovms.exe --rest_port 8000 --source_model srang992/Llama-3.2-3B-Instruct-ov-INT4 --model_repository_path models --tool_parser llama3 --target_device GPU --task text_generation --enable_tool_guided_generation true --cache_dir .cache --enable_prefix_caching true ``` ::: -:::{tab-item} Mistral-7B-Instruct-v0.3 -:sync: Mistral-7B-Instruct-v0.3 -```bat -ovms.exe --rest_port 8000 --source_model OpenVINO/Mistral-7B-Instruct-v0.3-int4-ov --model_repository_path models --tool_parser mistral --target_device GPU --task text_generation --cache_dir .cache --enable_prefix_caching true -``` -::: :::{tab-item} Phi-4-mini-instruct :sync: Phi-4-mini-instruct ```bat @@ -105,13 +99,30 @@ ovms.exe --rest_port 8000 --source_model OpenVINO/gpt-oss-20b-int4-ov --model_re ```bat ovms.exe --rest_port 8000 --source_model OpenVINO/Qwen3-8B-int4-cw-ov --model_repository_path models --tool_parser hermes3 --target_device NPU --task text_generation --enable_prefix_caching true --cache_dir .cache --max_prompt_len 4000 ``` + +```bat +python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Qwen3-8B-int4-cw-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather +``` + +```text +The current weather in Tokyo is mainly clear with a temperature of 11.5°C. The relative humidity is at 82%, and the dew point is 8.5°C. The wind is blowing from the S at 6.8 km/h, with gusts up to 13.7 km/h. The atmospheric pressure is 1017.1 hPa, and there is 21% cloud cover. Visibility is 24.1 km. +``` ::: :::{tab-item} Qwen3-4B :sync: Qwen3-4B ```bat ovms.exe --rest_port 8000 --source_model FluidInference/qwen3-4b-int4-ov-npu --model_repository_path models --tool_parser hermes3 --target_device NPU --task text_generation --enable_prefix_caching true --cache_dir .cache --max_prompt_len 4000 ``` + +```bat +python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Qwen3-8B-int4-cw-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather +``` + +```text +The current weather in Tokyo is mainly clear, with a temperature of 11.5°C. The relative humidity is at 82%, and the dew point is at 8.5°C. There is a wind blowing from the south at 6.8 km/h, with gusts up to 13.7 km/h. The atmospheric pressure is 1017.1 hPa, and there is 21% cloud cover. The visibility is 24.1 km. +``` ::: +:::: > **Note:** Setting the `--max_prompt_len` parameter too high may lead to performance degradation. It is recommended to use the smallest value that meets your requirements. @@ -148,6 +159,21 @@ python openai_agent.py --query "What is the current weather in Tokyo?" --model O The current weather in Tokyo is clear with a temperature of 8.3°C (feels like 5.0°C). The relative humidity is at 50%, and the dew point is at -1.5°C. Winds are coming from the NNW at 6.8 km/h with gusts up to 21.2 km/h. The atmospheric pressure is 1021.5 hPa with 0% cloud cover. Visibility is 24.1 km. ``` ::: +:::{tab-item} Phi-4-mini-instruct +:sync: Phi-4-mini-instruct +```bash +docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models openvino/model_server:weekly \ +--rest_port 8000 --model_repository_path models --source_model OpenVINO/Phi-4-mini-instruct-int4-ov --tool_parser hermes3 --task text_generation --enable_prefix_caching true +``` + +```bash +python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Phi-4-mini-instruct-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather --tool-choice required +``` + +```text +The current weather in Tokyo is mostly clear with a temperature of 12.4°C. The relative humidity is at 68%, and the dew point is at 6.7°C. Winds are coming from the SSE at a speed of 5.3 km/h, with gusts reaching up to 25.2 km/h. The atmospheric pressure is 1017.9 hPa, and there is a 23% cloud cover. Visibility is good at 24.1 km. +``` +::: :::{tab-item} Qwen3-Coder-30B-A3B-Instruct :sync: Qwen3-Coder-30B-A3B-Instruct ```bash @@ -160,7 +186,18 @@ python openai_agent.py --query "What is the current weather in Tokyo?" --model O ``` ```text -The current weather in Tokyo is clear sky with a temperature of 5.5°C (feels like 2.8°C). The relative humidity is at 64%, and the dew point is -0.8°C. Wind is blowing from the NNE at 3.2 km/h with gusts up to 10.8 km/h. The atmospheric pressure is 1023.4 hPa with 0% cloud cover. Visibility is 24.1 km. +The current weather in Tokyo is as follows: +- **Condition**: Mainly clear +- **Temperature**: 11.8°C +- **Relative Humidity**: 78% +- **Dew Point**: 8.1°C +- **Wind**: Blowing from the SSE at 6.4 km/h with gusts up to 9.7 km/h +- **Atmospheric Pressure**: 1017.5 hPa +- **Cloud Cover**: 22% +- **Visibility**: 24.1 km +- **UV Index**: Not specified + +It's a relatively pleasant day with clear skies and mild temperatures. ``` ::: :::: @@ -179,6 +216,14 @@ It can be applied using the commands below: docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly \ --rest_port 8000 --model_repository_path models --source_model OpenVINO/Qwen3-8B-int4-ov --tool_parser hermes3 --target_device GPU --task text_generation --enable_prefix_caching true ``` + +```bash +python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Qwen3-8B-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather +``` + +```text +The current weather in Tokyo is mainly clear with a temperature of 11.7°C. The relative humidity is at 74%, and the dew point is 7.2°C. The wind is blowing from the southeast at 4.2 km/h, with gusts up to 22.7 km/h. The atmospheric pressure is 1018.0 hPa, and there is 44% cloud cover. Visibility is 24.1 km. +``` ::: :::{tab-item} Qwen3-4B :sync: Qwen3-4B @@ -186,6 +231,14 @@ docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/model docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly \ --rest_port 8000 --model_repository_path models --source_model OpenVINO/Qwen3-4B-int4-ov --tool_parser hermes3 --target_device GPU --task text_generation --enable_prefix_caching true ``` + +```bash +python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Qwen3-4B-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather +``` + +```text +The current weather in Tokyo is mainly clear. The temperature is 11.7°C, with a relative humidity of 74% and a dew point of 7.2°C. The wind is coming from the SSE at 4.2 km/h, with gusts up to 22.7 km/h. The atmospheric pressure is 1018.0 hPa, with 44% cloud cover. Visibility is 24.1 km. +``` ::: :::{tab-item} Qwen3-Coder-30B-A3B-Instruct :sync: Qwen3-Coder-30B-A3B-Instruct @@ -193,6 +246,22 @@ docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/model docker run -d --user $(id -u):$(id -g) -e MOE_USE_MICRO_GEMM_PREFILL=0 --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly \ --rest_port 8000 --source_model OpenVINO/Qwen3-Coder-30B-A3B-Instruct-int4-ov --model_repository_path models --tool_parser qwen3coder --target_device GPU --task text_generation --enable_tool_guided_generation true --enable_prefix_caching true ``` + +```bash +python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Qwen3-Coder-30B-A3B-Instruct-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather +``` + +```text +The current weather in Tokyo is as follows: +- **Condition**: Mainly clear +- **Temperature**: 11.7°C +- **Relative Humidity**: 74% +- **Dew Point**: 7.2°C +- **Wind**: SSE at 4.2 km/h, with gusts up to 22.7 km/h +- **Atmospheric Pressure**: 1018.0 hPa +- **Cloud Cover**: 44% +- **Visibility**: 24.1 km +``` ::: :::{tab-item} gpt-oss-20b :sync: gpt-oss-20b @@ -203,6 +272,24 @@ docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/model ``` > **Note:** Continuous batching and paged attention are supported for GPT‑OSS. However, when deployed on GPU, the model may experience reduced accuracy under high‑concurrency workloads. This issue will be resolved in version 2026.1 and in the upcoming weekly release. CPU execution is not affected. +```bash +python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/gpt-oss-20b-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather +``` + +```text +**Tokyo – Current Weather** + +- **Condition:** Mainly clear +- **Temperature:** 11.7 °C +- **Humidity:** 74 % +- **Dew Point:** 7.2 °C +- **Wind:** 4.2 km/h from the SSE, gusts up to 22.7 km/h +- **Pressure:** 1018.0 hPa +- **Cloud Cover:** 44 % +- **Visibility:** 24.1 km + +Enjoy your day! +``` ::: :::: From e3d9c4bdf11b284497d9a97772d3cb5301983dfa Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Pawe=C5=82=20Rzepecki?= Date: Mon, 9 Mar 2026 07:19:13 +0100 Subject: [PATCH 07/22] Apply suggestions from code review Co-authored-by: Trawinski, Dariusz --- demos/continuous_batching/agentic_ai/README.md | 13 ++++++++----- 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/demos/continuous_batching/agentic_ai/README.md b/demos/continuous_batching/agentic_ai/README.md index 21156fd078..269f17ac4c 100644 --- a/demos/continuous_batching/agentic_ai/README.md +++ b/demos/continuous_batching/agentic_ai/README.md @@ -22,19 +22,22 @@ docker build -t mcp-weather-server:sse . docker run -d -p 8080:8080 -e PORT=8080 mcp-weather-server:sse uv run python -m mcp_weather_server --mode sse ``` -> **Note:** On Windows the MCP server will be demonstrated as an instance with stdio interface inside the agent application +### Windows +On Windows the MCP server will be demonstrated as an instance with stdio interface inside the agent application. +File system MCP server requires NodeJS and npx, visit https://nodejs.org/en/download. The weather MCP should be installed as python package: +```pip install python-dateutil mcp_weather_server``` ## Start the agent Install the application requirements ```console -curl https://raw.githubusercontent.com/openvinotoolkit/model_server/main/demos/continuous_batching/agentic_ai/openai_agent.py -o openai_agent.py -pip install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/main/demos/continuous_batching/agentic_ai/requirements.txt +curl https://raw.githubusercontent.com/openvinotoolkit/model_server/main/demos/continuous_batching/agentic_ai/openai_agent.py -O -L +pip install openai-agents openai ``` Make sure nodejs and npx are installed. On ubuntu it would require `sudo apt install nodejs npm`. On windows, visit https://nodejs.org/en/download. It is needed for the `file system` MCP server. -For windows applications it may be required to set environmental variable to enforce utf-8 encodeing in python: +For windows applications it may be required to set environment variable to enforce utf-8 encodeing in python: ```bat set PYTHONUTF8=1 @@ -53,7 +56,7 @@ as mentioned in [deployment guide](../../../docs/deploying_server_baremetal.md), :::{tab-item} Qwen3-8B :sync: Qwen3-8B ```bat -ovms.exe --rest_port 8000 --source_model OpenVINO/Qwen3-8B-int4-ov --model_repository_path models --tool_parser hermes3 --target_device GPU --task text_generation --cache_dir .cache --enable_prefix_caching true +ovms.exe --rest_port 8000 --source_model OpenVINO/Qwen3-8B-int4-ov --model_repository_path models --tool_parser hermes3 --target_device GPU --task text_generation --cache_dir .cache ``` ::: :::{tab-item} Qwen3-4B From 8fec692559e133233d3a84a1a257dc5d55b2a78d Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Pawe=C5=82=20Rzepecki?= Date: Mon, 9 Mar 2026 07:21:14 +0100 Subject: [PATCH 08/22] Update demos/continuous_batching/agentic_ai/README.md Co-authored-by: Trawinski, Dariusz --- demos/continuous_batching/agentic_ai/README.md | 1 - 1 file changed, 1 deletion(-) diff --git a/demos/continuous_batching/agentic_ai/README.md b/demos/continuous_batching/agentic_ai/README.md index 269f17ac4c..5ff6234b6d 100644 --- a/demos/continuous_batching/agentic_ai/README.md +++ b/demos/continuous_batching/agentic_ai/README.md @@ -10,7 +10,6 @@ Here are presented required steps to deploy language models trained for tools su The application employing OpenAI agent SDK is using MCP server. It is equipped with a set of tools to providing context for the content generation. The tools can also be used for automation purposes based on input in text format. -> **Note:** On Windows, make sure to use the weekly or 2025.4 release packages for proper functionality. ## Start MCP server with SSE interface From 88af3d20352e00f203ffa038edcd1d30b829998e Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Pawe=C5=82=20Rzepecki?= Date: Mon, 9 Mar 2026 07:21:27 +0100 Subject: [PATCH 09/22] Update demos/continuous_batching/agentic_ai/README.md Co-authored-by: Trawinski, Dariusz --- demos/continuous_batching/agentic_ai/README.md | 1 - 1 file changed, 1 deletion(-) diff --git a/demos/continuous_batching/agentic_ai/README.md b/demos/continuous_batching/agentic_ai/README.md index 5ff6234b6d..e597fc4e27 100644 --- a/demos/continuous_batching/agentic_ai/README.md +++ b/demos/continuous_batching/agentic_ai/README.md @@ -34,7 +34,6 @@ Install the application requirements curl https://raw.githubusercontent.com/openvinotoolkit/model_server/main/demos/continuous_batching/agentic_ai/openai_agent.py -O -L pip install openai-agents openai ``` -Make sure nodejs and npx are installed. On ubuntu it would require `sudo apt install nodejs npm`. On windows, visit https://nodejs.org/en/download. It is needed for the `file system` MCP server. For windows applications it may be required to set environment variable to enforce utf-8 encodeing in python: From dcc3dd4f7d813961a80abca9457d6d113e87f15f Mon Sep 17 00:00:00 2001 From: Pawel Date: Fri, 6 Mar 2026 14:56:04 +0100 Subject: [PATCH 10/22] save --- .../continuous_batching/agentic_ai/README.md | 104 +++++++++++++++++- 1 file changed, 100 insertions(+), 4 deletions(-) diff --git a/demos/continuous_batching/agentic_ai/README.md b/demos/continuous_batching/agentic_ai/README.md index e597fc4e27..76079c8c74 100644 --- a/demos/continuous_batching/agentic_ai/README.md +++ b/demos/continuous_batching/agentic_ai/README.md @@ -53,42 +53,102 @@ as mentioned in [deployment guide](../../../docs/deploying_server_baremetal.md), ::::{tab-set} :::{tab-item} Qwen3-8B :sync: Qwen3-8B +Pull and start OVMS: ```bat ovms.exe --rest_port 8000 --source_model OpenVINO/Qwen3-8B-int4-ov --model_repository_path models --tool_parser hermes3 --target_device GPU --task text_generation --cache_dir .cache ``` + +Use MCP server: +```bat +python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Qwen3-8B-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather +``` + +Exemplary output: +```text +The current weather in Tokyo is partly cloudy with a temperature of 8.4°C. The relative humidity is at 91%, and the dew point is 7.0°C. The wind is blowing from the SSE at 4.2 km/h, with gusts up to 15.5 km/h. The atmospheric pressure is 1016.0 hPa, with 72% cloud cover. Visibility is 24.1 km. +``` ::: :::{tab-item} Qwen3-4B :sync: Qwen3-4B +Pull and start OVMS: ```bat ovms.exe --rest_port 8000 --source_model OpenVINO/Qwen3-4B-int4-ov --model_repository_path models --tool_parser hermes3 --target_device GPU --task text_generation --cache_dir .cache --enable_prefix_caching true ``` -::: -:::{tab-item} Llama-3.2-3B-Instruct -:sync: Llama-3.2-3B-Instruct + +Use MCP server: ```bat -ovms.exe --rest_port 8000 --source_model srang992/Llama-3.2-3B-Instruct-ov-INT4 --model_repository_path models --tool_parser llama3 --target_device GPU --task text_generation --enable_tool_guided_generation true --cache_dir .cache --enable_prefix_caching true +python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Qwen3-4B-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather +``` + +Exemplary output: +```text +The current weather in Tokyo is partly cloudy with a temperature of 8.4°C. The relative humidity is at 91%, and the dew point is 7.0°C. The wind is coming from the SSE at 4.2 km/h with gusts up to 15.5 km/h. The atmospheric pressure is 1016.0 hPa, with 72% cloud cover. Visibility is 24.1 km. ``` ::: :::{tab-item} Phi-4-mini-instruct :sync: Phi-4-mini-instruct +Pull and start OVMS: ```bat ovms.exe --rest_port 8000 --source_model OpenVINO/Phi-4-mini-instruct-int4-ov --model_repository_path models --tool_parser phi4 --target_device GPU --task text_generation --enable_tool_guided_generation true --cache_dir .cache --max_num_batched_tokens 99999 --enable_prefix_caching true ``` + +Use MCP server: +```bat +python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Phi-4-mini-instruct-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather --tool-choice required +``` + +Exemplary output: +```text +The current weather in Tokyo is partly cloudy with a temperature of 8.4°C. The relative humidity is quite high at 91%, and the dew point is at 7.0°C, indicating that the air is moist. Winds are coming from the southeast at a gentle breeze of 4.2 km/h, with gusts reaching up to 15.5 km/h. The atmospheric pressure is steady at 1016.0 hPa, and cloud cover is at 72%. Visibility is excellent at 24.1 km, suggesting clear conditions for most outdoor activities. +``` ::: :::{tab-item} Qwen3-Coder-30B-A3B-Instruct :sync: Qwen3-Coder-30B-A3B-Instruct +Pull and start OVMS: ```bat set MOE_USE_MICRO_GEMM_PREFILL=0 ovms.exe --rest_port 8000 --source_model OpenVINO/Qwen3-Coder-30B-A3B-Instruct-int4-ov --model_repository_path models --tool_parser qwen3coder --target_device GPU --task text_generation --cache_dir .cache --enable_prefix_caching true ``` + +Use MCP server: +```bat +python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Qwen3-Coder-30B-A3B-Instruct-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather +``` + +Exemplary output: +```text +The current weather in Tokyo is mainly clear with a temperature of 8.7°C. The relative humidity is at 89%, and the dew point is 6.9°C. The wind is blowing from the SSE at 5.0 km/h, with gusts reaching up to 22.0 km/h. The atmospheric pressure is 1014.4 hPa, and there is 34% cloud cover. The visibility is 24.1 km. +``` ::: :::{tab-item} gpt-oss-20b :sync: gpt-oss-20b +Pull and start OVMS: ```bat ovms.exe --rest_port 8000 --source_model OpenVINO/gpt-oss-20b-int4-ov --model_repository_path models --tool_parser gptoss --reasoning_parser gptoss --task text_generation --enable_prefix_caching true --target_device GPU ``` > **Note:** Continuous batching and paged attention are supported for GPT‑OSS. However, when deployed on GPU, the model may experience reduced accuracy under high‑concurrency workloads. This issue will be resolved in version 2026.1 and in the upcoming weekly release. CPU execution is not affected. +Use MCP server: +```bat +python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/gpt-oss-20b-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather +``` + +Exemplary output: +```text +**Tokyo – Current Weather** + +- **Condition:** Mainly clear +- **Temperature:** 8.7 °C +- **Humidity:** 89 % +- **Dew Point:** 6.9 °C +- **Wind:** SSE at 5 km/h (gusts up to 22 km/h) +- **Pressure:** 1014.4 hPa +- **Cloud Cover:** 34 % +- **Visibility:** 24.1 km + +Let me know if you’d like more details or a forecast! +``` + ::: :::: @@ -97,28 +157,34 @@ ovms.exe --rest_port 8000 --source_model OpenVINO/gpt-oss-20b-int4-ov --model_re ::::{tab-set} :::{tab-item} Qwen3-8B :sync: Qwen3-8B +Pull and start OVMS: ```bat ovms.exe --rest_port 8000 --source_model OpenVINO/Qwen3-8B-int4-cw-ov --model_repository_path models --tool_parser hermes3 --target_device NPU --task text_generation --enable_prefix_caching true --cache_dir .cache --max_prompt_len 4000 ``` +Use MCP server: ```bat python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Qwen3-8B-int4-cw-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather ``` +Exemplary output: ```text The current weather in Tokyo is mainly clear with a temperature of 11.5°C. The relative humidity is at 82%, and the dew point is 8.5°C. The wind is blowing from the S at 6.8 km/h, with gusts up to 13.7 km/h. The atmospheric pressure is 1017.1 hPa, and there is 21% cloud cover. Visibility is 24.1 km. ``` ::: :::{tab-item} Qwen3-4B :sync: Qwen3-4B +Pull and start OVMS: ```bat ovms.exe --rest_port 8000 --source_model FluidInference/qwen3-4b-int4-ov-npu --model_repository_path models --tool_parser hermes3 --target_device NPU --task text_generation --enable_prefix_caching true --cache_dir .cache --max_prompt_len 4000 ``` +Use MCP server: ```bat python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Qwen3-8B-int4-cw-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather ``` +Exemplary output: ```text The current weather in Tokyo is mainly clear, with a temperature of 11.5°C. The relative humidity is at 82%, and the dew point is at 8.5°C. There is a wind blowing from the south at 6.8 km/h, with gusts up to 13.7 km/h. The atmospheric pressure is 1017.1 hPa, and there is 21% cloud cover. The visibility is 24.1 km. ``` @@ -132,60 +198,72 @@ The current weather in Tokyo is mainly clear, with a temperature of 11.5°C. The ::::{tab-set} :::{tab-item} Qwen3-8B :sync: Qwen3-8B +Pull and start OVMS: ```bash docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models openvino/model_server:weekly \ --rest_port 8000 --model_repository_path models --source_model OpenVINO/Qwen3-8B-int4-ov --tool_parser hermes3 --task text_generation --enable_prefix_caching true ``` +Use MCP server: ```bash python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Qwen3-8B-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather ``` +Exemplary output: ```text The current weather in Tokyo is clear sky with a temperature of 8.3°C (feels like 5.0°C). The relative humidity is at 50%, and the dew point is -1.5°C. Wind is blowing from the NNW at 6.8 km/h with gusts up to 21.2 km/h. The atmospheric pressure is 1021.5 hPa with 0% cloud cover, and visibility is 24.1 km. ``` ::: :::{tab-item} Qwen3-4B :sync: Qwen3-4B +Pull and start OVMS: ```bash docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models openvino/model_server:weekly \ --rest_port 8000 --model_repository_path models --source_model OpenVINO/Qwen3-4B-int4-ov --tool_parser hermes3 --task text_generation --enable_prefix_caching true ``` +Use MCP server: ```bash python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Qwen3-4B-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather ``` +Exemplary output: ```text The current weather in Tokyo is clear with a temperature of 8.3°C (feels like 5.0°C). The relative humidity is at 50%, and the dew point is at -1.5°C. Winds are coming from the NNW at 6.8 km/h with gusts up to 21.2 km/h. The atmospheric pressure is 1021.5 hPa with 0% cloud cover. Visibility is 24.1 km. ``` ::: :::{tab-item} Phi-4-mini-instruct :sync: Phi-4-mini-instruct +Pull and start OVMS: ```bash docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models openvino/model_server:weekly \ --rest_port 8000 --model_repository_path models --source_model OpenVINO/Phi-4-mini-instruct-int4-ov --tool_parser hermes3 --task text_generation --enable_prefix_caching true ``` +Use MCP server: ```bash python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Phi-4-mini-instruct-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather --tool-choice required ``` +Exemplary output: ```text The current weather in Tokyo is mostly clear with a temperature of 12.4°C. The relative humidity is at 68%, and the dew point is at 6.7°C. Winds are coming from the SSE at a speed of 5.3 km/h, with gusts reaching up to 25.2 km/h. The atmospheric pressure is 1017.9 hPa, and there is a 23% cloud cover. Visibility is good at 24.1 km. ``` ::: :::{tab-item} Qwen3-Coder-30B-A3B-Instruct :sync: Qwen3-Coder-30B-A3B-Instruct +Pull and start OVMS: ```bash docker run -d --user $(id -u):$(id -g) --rm -e MOE_USE_MICRO_GEMM_PREFILL=0 -p 8000:8000 -v $(pwd)/models:/models openvino/model_server:weekly \ --rest_port 8000 --source_model OpenVINO/Qwen3-Coder-30B-A3B-Instruct-int4-ov --model_repository_path models --tool_parser qwen3coder --task text_generation --enable_prefix_caching true ``` +Use MCP server: ```bash python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Qwen3-Coder-30B-A3B-Instruct-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather ``` +Exemplary output: ```text The current weather in Tokyo is as follows: - **Condition**: Mainly clear @@ -213,45 +291,54 @@ It can be applied using the commands below: ::::{tab-set} :::{tab-item} Qwen3-8B :sync: Qwen3-8B +Pull and start OVMS: ```bash docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly \ --rest_port 8000 --model_repository_path models --source_model OpenVINO/Qwen3-8B-int4-ov --tool_parser hermes3 --target_device GPU --task text_generation --enable_prefix_caching true ``` +Use MCP server: ```bash python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Qwen3-8B-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather ``` +Exemplary output: ```text The current weather in Tokyo is mainly clear with a temperature of 11.7°C. The relative humidity is at 74%, and the dew point is 7.2°C. The wind is blowing from the southeast at 4.2 km/h, with gusts up to 22.7 km/h. The atmospheric pressure is 1018.0 hPa, and there is 44% cloud cover. Visibility is 24.1 km. ``` ::: :::{tab-item} Qwen3-4B :sync: Qwen3-4B +Pull and start OVMS: ```bash docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly \ --rest_port 8000 --model_repository_path models --source_model OpenVINO/Qwen3-4B-int4-ov --tool_parser hermes3 --target_device GPU --task text_generation --enable_prefix_caching true ``` +Use MCP server: ```bash python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Qwen3-4B-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather ``` +Exemplary output: ```text The current weather in Tokyo is mainly clear. The temperature is 11.7°C, with a relative humidity of 74% and a dew point of 7.2°C. The wind is coming from the SSE at 4.2 km/h, with gusts up to 22.7 km/h. The atmospheric pressure is 1018.0 hPa, with 44% cloud cover. Visibility is 24.1 km. ``` ::: :::{tab-item} Qwen3-Coder-30B-A3B-Instruct :sync: Qwen3-Coder-30B-A3B-Instruct +Pull and start OVMS: ```bash docker run -d --user $(id -u):$(id -g) -e MOE_USE_MICRO_GEMM_PREFILL=0 --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly \ --rest_port 8000 --source_model OpenVINO/Qwen3-Coder-30B-A3B-Instruct-int4-ov --model_repository_path models --tool_parser qwen3coder --target_device GPU --task text_generation --enable_tool_guided_generation true --enable_prefix_caching true ``` +Use MCP server: ```bash python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Qwen3-Coder-30B-A3B-Instruct-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather ``` +Exemplary output: ```text The current weather in Tokyo is as follows: - **Condition**: Mainly clear @@ -266,6 +353,7 @@ The current weather in Tokyo is as follows: ::: :::{tab-item} gpt-oss-20b :sync: gpt-oss-20b +Pull and start OVMS: ```bash docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly \ --rest_port 8000 --source_model OpenVINO/gpt-oss-20b-int4-ov --model_repository_path models \ @@ -273,10 +361,12 @@ docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/model ``` > **Note:** Continuous batching and paged attention are supported for GPT‑OSS. However, when deployed on GPU, the model may experience reduced accuracy under high‑concurrency workloads. This issue will be resolved in version 2026.1 and in the upcoming weekly release. CPU execution is not affected. +Use MCP server: ```bash python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/gpt-oss-20b-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather ``` +Exemplary output: ```text **Tokyo – Current Weather** @@ -303,30 +393,36 @@ It can be applied using the commands below: ::::{tab-set} :::{tab-item} Qwen3-8B :sync: Qwen3-8B +Pull and start OVMS: ```bash docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/accel --group-add=$(stat -c "%g" /dev/dri/render* | head -1) openvino/model_server:weekly \ --rest_port 8000 --model_repository_path models --source_model OpenVINO/Qwen3-8B-int4-cw-ov --tool_parser hermes3 --target_device NPU --task text_generation --enable_prefix_caching true --max_prompt_len 4000 ``` +Use MCP server: ```bash python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Qwen3-8B-int4-cw-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather ``` +Exemplary output: ```text The current weather in Tokyo is clear sky with a temperature of 8.3°C (feels like 5.0°C). The relative humidity is at 50%, and the dew point is at -1.5°C. The wind is blowing from the NNW at 6.8 km/h with gusts up to 21.2 km/h. The atmospheric pressure is 1021.5 hPa with 0% cloud cover, and the visibility is 24.1 km. ``` ::: :::{tab-item} Qwen3-4B :sync: Qwen3-4B +Pull and start OVMS: ```bash docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/accel --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly \ --rest_port 8000 --model_repository_path models --source_model FluidInference/qwen3-4b-int4-ov-npu --tool_parser hermes3 --target_device NPU --task text_generation --enable_prefix_caching true --max_prompt_len 4000 ``` +Use MCP server: ```bash python openai_agent.py --query "What is the current weather in Tokyo?" --model FluidInference/qwen3-4b-int4-ov-npu --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather --stream ``` +Exemplary output: ```text The current weather in Tokyo is clear sky with a temperature of 8.3°C (feels like 5.0°C). The relative humidity is at 50%, and the dew point is at -1.5°C. There is a wind blowing from the NNW at 6.8 km/h with gusts up to 21.2 km/h. The atmospheric pressure is 1021.5 hPa with 0% cloud cover. The visibility is 24.1 km. ``` From fd90dd485773d54aa1c90e5fa4bc704f85a78ce6 Mon Sep 17 00:00:00 2001 From: Pawel Date: Mon, 9 Mar 2026 07:31:41 +0100 Subject: [PATCH 11/22] remove default command --- .../continuous_batching/agentic_ai/README.md | 32 +++++++++---------- 1 file changed, 16 insertions(+), 16 deletions(-) diff --git a/demos/continuous_batching/agentic_ai/README.md b/demos/continuous_batching/agentic_ai/README.md index 76079c8c74..9cff3dca13 100644 --- a/demos/continuous_batching/agentic_ai/README.md +++ b/demos/continuous_batching/agentic_ai/README.md @@ -72,7 +72,7 @@ The current weather in Tokyo is partly cloudy with a temperature of 8.4°C. The :sync: Qwen3-4B Pull and start OVMS: ```bat -ovms.exe --rest_port 8000 --source_model OpenVINO/Qwen3-4B-int4-ov --model_repository_path models --tool_parser hermes3 --target_device GPU --task text_generation --cache_dir .cache --enable_prefix_caching true +ovms.exe --rest_port 8000 --source_model OpenVINO/Qwen3-4B-int4-ov --model_repository_path models --tool_parser hermes3 --target_device GPU --task text_generation --cache_dir .cache ``` Use MCP server: @@ -89,7 +89,7 @@ The current weather in Tokyo is partly cloudy with a temperature of 8.4°C. The :sync: Phi-4-mini-instruct Pull and start OVMS: ```bat -ovms.exe --rest_port 8000 --source_model OpenVINO/Phi-4-mini-instruct-int4-ov --model_repository_path models --tool_parser phi4 --target_device GPU --task text_generation --enable_tool_guided_generation true --cache_dir .cache --max_num_batched_tokens 99999 --enable_prefix_caching true +ovms.exe --rest_port 8000 --source_model OpenVINO/Phi-4-mini-instruct-int4-ov --model_repository_path models --tool_parser phi4 --target_device GPU --task text_generation --enable_tool_guided_generation true --cache_dir .cache --max_num_batched_tokens 99999 ``` Use MCP server: @@ -107,7 +107,7 @@ The current weather in Tokyo is partly cloudy with a temperature of 8.4°C. The Pull and start OVMS: ```bat set MOE_USE_MICRO_GEMM_PREFILL=0 -ovms.exe --rest_port 8000 --source_model OpenVINO/Qwen3-Coder-30B-A3B-Instruct-int4-ov --model_repository_path models --tool_parser qwen3coder --target_device GPU --task text_generation --cache_dir .cache --enable_prefix_caching true +ovms.exe --rest_port 8000 --source_model OpenVINO/Qwen3-Coder-30B-A3B-Instruct-int4-ov --model_repository_path models --tool_parser qwen3coder --target_device GPU --task text_generation --cache_dir .cache ``` Use MCP server: @@ -124,7 +124,7 @@ The current weather in Tokyo is mainly clear with a temperature of 8.7°C. The r :sync: gpt-oss-20b Pull and start OVMS: ```bat -ovms.exe --rest_port 8000 --source_model OpenVINO/gpt-oss-20b-int4-ov --model_repository_path models --tool_parser gptoss --reasoning_parser gptoss --task text_generation --enable_prefix_caching true --target_device GPU +ovms.exe --rest_port 8000 --source_model OpenVINO/gpt-oss-20b-int4-ov --model_repository_path models --tool_parser gptoss --reasoning_parser gptoss --task text_generation --target_device GPU ``` > **Note:** Continuous batching and paged attention are supported for GPT‑OSS. However, when deployed on GPU, the model may experience reduced accuracy under high‑concurrency workloads. This issue will be resolved in version 2026.1 and in the upcoming weekly release. CPU execution is not affected. @@ -159,7 +159,7 @@ Let me know if you’d like more details or a forecast! :sync: Qwen3-8B Pull and start OVMS: ```bat -ovms.exe --rest_port 8000 --source_model OpenVINO/Qwen3-8B-int4-cw-ov --model_repository_path models --tool_parser hermes3 --target_device NPU --task text_generation --enable_prefix_caching true --cache_dir .cache --max_prompt_len 4000 +ovms.exe --rest_port 8000 --source_model OpenVINO/Qwen3-8B-int4-cw-ov --model_repository_path models --tool_parser hermes3 --target_device NPU --task text_generation --cache_dir .cache --max_prompt_len 4000 ``` Use MCP server: @@ -176,7 +176,7 @@ The current weather in Tokyo is mainly clear with a temperature of 11.5°C. The :sync: Qwen3-4B Pull and start OVMS: ```bat -ovms.exe --rest_port 8000 --source_model FluidInference/qwen3-4b-int4-ov-npu --model_repository_path models --tool_parser hermes3 --target_device NPU --task text_generation --enable_prefix_caching true --cache_dir .cache --max_prompt_len 4000 +ovms.exe --rest_port 8000 --source_model FluidInference/qwen3-4b-int4-ov-npu --model_repository_path models --tool_parser hermes3 --target_device NPU --task text_generation --cache_dir .cache --max_prompt_len 4000 ``` Use MCP server: @@ -201,7 +201,7 @@ The current weather in Tokyo is mainly clear, with a temperature of 11.5°C. The Pull and start OVMS: ```bash docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model OpenVINO/Qwen3-8B-int4-ov --tool_parser hermes3 --task text_generation --enable_prefix_caching true +--rest_port 8000 --model_repository_path models --source_model OpenVINO/Qwen3-8B-int4-ov --tool_parser hermes3 --task text_generation ``` Use MCP server: @@ -219,7 +219,7 @@ The current weather in Tokyo is clear sky with a temperature of 8.3°C (feels li Pull and start OVMS: ```bash docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model OpenVINO/Qwen3-4B-int4-ov --tool_parser hermes3 --task text_generation --enable_prefix_caching true +--rest_port 8000 --model_repository_path models --source_model OpenVINO/Qwen3-4B-int4-ov --tool_parser hermes3 --task text_generation ``` Use MCP server: @@ -237,7 +237,7 @@ The current weather in Tokyo is clear with a temperature of 8.3°C (feels like 5 Pull and start OVMS: ```bash docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model OpenVINO/Phi-4-mini-instruct-int4-ov --tool_parser hermes3 --task text_generation --enable_prefix_caching true +--rest_port 8000 --model_repository_path models --source_model OpenVINO/Phi-4-mini-instruct-int4-ov --tool_parser hermes3 --task text_generation ``` Use MCP server: @@ -255,7 +255,7 @@ The current weather in Tokyo is mostly clear with a temperature of 12.4°C. The Pull and start OVMS: ```bash docker run -d --user $(id -u):$(id -g) --rm -e MOE_USE_MICRO_GEMM_PREFILL=0 -p 8000:8000 -v $(pwd)/models:/models openvino/model_server:weekly \ ---rest_port 8000 --source_model OpenVINO/Qwen3-Coder-30B-A3B-Instruct-int4-ov --model_repository_path models --tool_parser qwen3coder --task text_generation --enable_prefix_caching true +--rest_port 8000 --source_model OpenVINO/Qwen3-Coder-30B-A3B-Instruct-int4-ov --model_repository_path models --tool_parser qwen3coder --task text_generation ``` Use MCP server: @@ -294,7 +294,7 @@ It can be applied using the commands below: Pull and start OVMS: ```bash docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model OpenVINO/Qwen3-8B-int4-ov --tool_parser hermes3 --target_device GPU --task text_generation --enable_prefix_caching true +--rest_port 8000 --model_repository_path models --source_model OpenVINO/Qwen3-8B-int4-ov --tool_parser hermes3 --target_device GPU --task text_generation ``` Use MCP server: @@ -312,7 +312,7 @@ The current weather in Tokyo is mainly clear with a temperature of 11.7°C. The Pull and start OVMS: ```bash docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model OpenVINO/Qwen3-4B-int4-ov --tool_parser hermes3 --target_device GPU --task text_generation --enable_prefix_caching true +--rest_port 8000 --model_repository_path models --source_model OpenVINO/Qwen3-4B-int4-ov --tool_parser hermes3 --target_device GPU --task text_generation ``` Use MCP server: @@ -330,7 +330,7 @@ The current weather in Tokyo is mainly clear. The temperature is 11.7°C, with a Pull and start OVMS: ```bash docker run -d --user $(id -u):$(id -g) -e MOE_USE_MICRO_GEMM_PREFILL=0 --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly \ ---rest_port 8000 --source_model OpenVINO/Qwen3-Coder-30B-A3B-Instruct-int4-ov --model_repository_path models --tool_parser qwen3coder --target_device GPU --task text_generation --enable_tool_guided_generation true --enable_prefix_caching true +--rest_port 8000 --source_model OpenVINO/Qwen3-Coder-30B-A3B-Instruct-int4-ov --model_repository_path models --tool_parser qwen3coder --target_device GPU --task text_generation --enable_tool_guided_generation true ``` Use MCP server: @@ -357,7 +357,7 @@ Pull and start OVMS: ```bash docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly \ --rest_port 8000 --source_model OpenVINO/gpt-oss-20b-int4-ov --model_repository_path models \ ---tool_parser gptoss --reasoning_parser gptoss --target_device GPU --task text_generation --enable_prefix_caching true +--tool_parser gptoss --reasoning_parser gptoss --target_device GPU --task text_generation ``` > **Note:** Continuous batching and paged attention are supported for GPT‑OSS. However, when deployed on GPU, the model may experience reduced accuracy under high‑concurrency workloads. This issue will be resolved in version 2026.1 and in the upcoming weekly release. CPU execution is not affected. @@ -396,7 +396,7 @@ It can be applied using the commands below: Pull and start OVMS: ```bash docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/accel --group-add=$(stat -c "%g" /dev/dri/render* | head -1) openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model OpenVINO/Qwen3-8B-int4-cw-ov --tool_parser hermes3 --target_device NPU --task text_generation --enable_prefix_caching true --max_prompt_len 4000 +--rest_port 8000 --model_repository_path models --source_model OpenVINO/Qwen3-8B-int4-cw-ov --tool_parser hermes3 --target_device NPU --task text_generation --max_prompt_len 4000 ``` Use MCP server: @@ -414,7 +414,7 @@ The current weather in Tokyo is clear sky with a temperature of 8.3°C (feels li Pull and start OVMS: ```bash docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/accel --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model FluidInference/qwen3-4b-int4-ov-npu --tool_parser hermes3 --target_device NPU --task text_generation --enable_prefix_caching true --max_prompt_len 4000 +--rest_port 8000 --model_repository_path models --source_model FluidInference/qwen3-4b-int4-ov-npu --tool_parser hermes3 --target_device NPU --task text_generation --max_prompt_len 4000 ``` Use MCP server: From de44c5c5bc0b27b9c9e0bc1611b7d3bb3888da93 Mon Sep 17 00:00:00 2001 From: Pawel Date: Mon, 9 Mar 2026 10:06:44 +0100 Subject: [PATCH 12/22] script corrected, doc fixes --- demos/continuous_batching/agentic_ai/README.md | 14 ++++---------- .../continuous_batching/agentic_ai/openai_agent.py | 14 ++++++++++++++ .../agentic_ai/requirements.txt | 4 ---- 3 files changed, 18 insertions(+), 14 deletions(-) delete mode 100644 demos/continuous_batching/agentic_ai/requirements.txt diff --git a/demos/continuous_batching/agentic_ai/README.md b/demos/continuous_batching/agentic_ai/README.md index 9cff3dca13..8b47b6e1af 100644 --- a/demos/continuous_batching/agentic_ai/README.md +++ b/demos/continuous_batching/agentic_ai/README.md @@ -35,12 +35,6 @@ curl https://raw.githubusercontent.com/openvinotoolkit/model_server/main/demos/c pip install openai-agents openai ``` -For windows applications it may be required to set environment variable to enforce utf-8 encodeing in python: - -```bat -set PYTHONUTF8=1 -``` - ## Start OVMS This deployment procedure assumes the model was pulled or exported using the procedure above. The exception are models from OpenVINO organization if they support tools correctly with the default template like "OpenVINO/Qwen3-8B-int4-ov" - they can be deployed in a single command pulling and staring the server. @@ -159,7 +153,7 @@ Let me know if you’d like more details or a forecast! :sync: Qwen3-8B Pull and start OVMS: ```bat -ovms.exe --rest_port 8000 --source_model OpenVINO/Qwen3-8B-int4-cw-ov --model_repository_path models --tool_parser hermes3 --target_device NPU --task text_generation --cache_dir .cache --max_prompt_len 4000 +ovms.exe --rest_port 8000 --source_model OpenVINO/Qwen3-8B-int4-cw-ov --model_repository_path models --tool_parser hermes3 --target_device NPU --task text_generation --cache_dir .cache --max_prompt_len 8000 ``` Use MCP server: @@ -176,7 +170,7 @@ The current weather in Tokyo is mainly clear with a temperature of 11.5°C. The :sync: Qwen3-4B Pull and start OVMS: ```bat -ovms.exe --rest_port 8000 --source_model FluidInference/qwen3-4b-int4-ov-npu --model_repository_path models --tool_parser hermes3 --target_device NPU --task text_generation --cache_dir .cache --max_prompt_len 4000 +ovms.exe --rest_port 8000 --source_model FluidInference/qwen3-4b-int4-ov-npu --model_repository_path models --tool_parser hermes3 --target_device NPU --task text_generation --cache_dir .cache --max_prompt_len 8000 ``` Use MCP server: @@ -396,7 +390,7 @@ It can be applied using the commands below: Pull and start OVMS: ```bash docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/accel --group-add=$(stat -c "%g" /dev/dri/render* | head -1) openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model OpenVINO/Qwen3-8B-int4-cw-ov --tool_parser hermes3 --target_device NPU --task text_generation --max_prompt_len 4000 +--rest_port 8000 --model_repository_path models --source_model OpenVINO/Qwen3-8B-int4-cw-ov --tool_parser hermes3 --target_device NPU --task text_generation --max_prompt_len 8000 ``` Use MCP server: @@ -414,7 +408,7 @@ The current weather in Tokyo is clear sky with a temperature of 8.3°C (feels li Pull and start OVMS: ```bash docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/accel --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model FluidInference/qwen3-4b-int4-ov-npu --tool_parser hermes3 --target_device NPU --task text_generation --max_prompt_len 4000 +--rest_port 8000 --model_repository_path models --source_model FluidInference/qwen3-4b-int4-ov-npu --tool_parser hermes3 --target_device NPU --task text_generation --max_prompt_len 8000 ``` Use MCP server: diff --git a/demos/continuous_batching/agentic_ai/openai_agent.py b/demos/continuous_batching/agentic_ai/openai_agent.py index 859d154b28..794c4715d4 100644 --- a/demos/continuous_batching/agentic_ai/openai_agent.py +++ b/demos/continuous_batching/agentic_ai/openai_agent.py @@ -19,6 +19,7 @@ import asyncio import os import platform +import sys from openai import AsyncOpenAI from agents import Agent, Runner, RunConfig @@ -37,6 +38,7 @@ ) API_KEY = "not_used" +os.environ["PYTHONUTF8"] = "1" env_proxy = {} http_proxy = os.environ.get("http_proxy") https_proxy = os.environ.get("https_proxy") @@ -76,6 +78,18 @@ async def run(query, agent, OVMS_MODEL_PROVIDER, stream: bool = False): else: result = await Runner.run(starting_agent=agent, input=query, run_config=RunConfig(model_provider=OVMS_MODEL_PROVIDER, tracing_disabled=True)) print(result.final_output) + + is_tool_call_present = False + + if hasattr(result, 'new_items') and result.new_items: + for item in result.new_items: + if hasattr(item, 'type') and item.type == "tool_call_item": + is_tool_call_present = True + + if is_tool_call_present: + sys.exit(0) + else: + sys.exit(1) if __name__ == "__main__": diff --git a/demos/continuous_batching/agentic_ai/requirements.txt b/demos/continuous_batching/agentic_ai/requirements.txt deleted file mode 100644 index 5147552b66..0000000000 --- a/demos/continuous_batching/agentic_ai/requirements.txt +++ /dev/null @@ -1,4 +0,0 @@ -openai-agents==0.2.11 -openai==1.107.0 -python-dateutil -mcp_weather_server \ No newline at end of file From ab3473b7e84979102234ea546c4d0bf97682b3c0 Mon Sep 17 00:00:00 2001 From: Pawel Date: Mon, 9 Mar 2026 10:35:08 +0100 Subject: [PATCH 13/22] reverting mistral, adding tool_call_indicator to streaming --- .../continuous_batching/agentic_ai/README.md | 51 +++++++++++++++++++ .../agentic_ai/openai_agent.py | 22 ++++---- 2 files changed, 63 insertions(+), 10 deletions(-) diff --git a/demos/continuous_batching/agentic_ai/README.md b/demos/continuous_batching/agentic_ai/README.md index 8b47b6e1af..b77a8dcada 100644 --- a/demos/continuous_batching/agentic_ai/README.md +++ b/demos/continuous_batching/agentic_ai/README.md @@ -96,6 +96,23 @@ Exemplary output: The current weather in Tokyo is partly cloudy with a temperature of 8.4°C. The relative humidity is quite high at 91%, and the dew point is at 7.0°C, indicating that the air is moist. Winds are coming from the southeast at a gentle breeze of 4.2 km/h, with gusts reaching up to 15.5 km/h. The atmospheric pressure is steady at 1016.0 hPa, and cloud cover is at 72%. Visibility is excellent at 24.1 km, suggesting clear conditions for most outdoor activities. ``` ::: +:::{tab-item} Mistral-7B-Instruct-v0.3 +:sync: Mistral-7B-Instruct-v0.3 +Pull and start OVMS: +```bat +ovms.exe --rest_port 8000 --model_repository_path models --source_model OpenVINO/Mistral-7B-Instruct-v0.3-int4-ov --tool_parser mistral --target_device GPU --task text_generation +``` + +Use MCP server: +```bat +python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Mistral-7B-Instruct-v0.3-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather --stream +``` + +Examplary output: +```text +The current weather in Tokyo on March 2, 2026 is Partly cloudy with a temperature of 9 degrees Celsius. +``` +::: :::{tab-item} Qwen3-Coder-30B-A3B-Instruct :sync: Qwen3-Coder-30B-A3B-Instruct Pull and start OVMS: @@ -244,6 +261,23 @@ Exemplary output: The current weather in Tokyo is mostly clear with a temperature of 12.4°C. The relative humidity is at 68%, and the dew point is at 6.7°C. Winds are coming from the SSE at a speed of 5.3 km/h, with gusts reaching up to 25.2 km/h. The atmospheric pressure is 1017.9 hPa, and there is a 23% cloud cover. Visibility is good at 24.1 km. ``` ::: +:::{tab-item} Mistral-7B-Instruct-v0.3 +:sync: Mistral-7B-Instruct-v0.3 +Pull and start OVMS: +```bash +docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models openvino/model_server:latest --rest_port 8000 --model_repository_path models --source_model OpenVINO/Mistral-7B-Instruct-v0.3-int4-ov --tool_parser mistral --task text_generation +``` + +Use MCP server: +```bash +python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Mistral-7B-Instruct-v0.3-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather --stream +``` + +Examplary output: +```text +The current weather in Tokyo on March 2, 2026 is Partly cloudy with a temperature of 9 degrees Celsius. +``` +::: :::{tab-item} Qwen3-Coder-30B-A3B-Instruct :sync: Qwen3-Coder-30B-A3B-Instruct Pull and start OVMS: @@ -319,6 +353,23 @@ Exemplary output: The current weather in Tokyo is mainly clear. The temperature is 11.7°C, with a relative humidity of 74% and a dew point of 7.2°C. The wind is coming from the SSE at 4.2 km/h, with gusts up to 22.7 km/h. The atmospheric pressure is 1018.0 hPa, with 44% cloud cover. Visibility is 24.1 km. ``` ::: +:::{tab-item} Mistral-7B-Instruct-v0.3 +:sync: Mistral-7B-Instruct-v0.3 +Pull and start OVMS: +```bash +docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:latest-gpu --rest_port 8000 --model_repository_path models --source_model OpenVINO/Mistral-7B-Instruct-v0.3-int4-ov --tool_parser mistral --target_device GPU --task text_generation +``` + +Use MCP server: +```bash +python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Mistral-7B-Instruct-v0.3-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather --stream +``` + +Examplary output: +```text +The current weather in Tokyo on March 2, 2026 is Partly cloudy with a temperature of 9 degrees Celsius. +``` +::: :::{tab-item} Qwen3-Coder-30B-A3B-Instruct :sync: Qwen3-Coder-30B-A3B-Instruct Pull and start OVMS: diff --git a/demos/continuous_batching/agentic_ai/openai_agent.py b/demos/continuous_batching/agentic_ai/openai_agent.py index 794c4715d4..0ee59d6984 100644 --- a/demos/continuous_batching/agentic_ai/openai_agent.py +++ b/demos/continuous_batching/agentic_ai/openai_agent.py @@ -49,6 +49,13 @@ RunConfig.tracing_disabled = False # Disable tracing for this example +def check_if_tool_calls_present(result) -> bool: + if hasattr(result, 'new_items') and result.new_items: + for item in result.new_items: + if hasattr(item, 'type') and item.type == "tool_call_item": + return True + return False + async def run(query, agent, OVMS_MODEL_PROVIDER, stream: bool = False): for server in agent.mcp_servers: await server.connect() @@ -79,17 +86,12 @@ async def run(query, agent, OVMS_MODEL_PROVIDER, stream: bool = False): result = await Runner.run(starting_agent=agent, input=query, run_config=RunConfig(model_provider=OVMS_MODEL_PROVIDER, tracing_disabled=True)) print(result.final_output) - is_tool_call_present = False + is_tool_call_present = check_if_tool_calls_present(result) - if hasattr(result, 'new_items') and result.new_items: - for item in result.new_items: - if hasattr(item, 'type') and item.type == "tool_call_item": - is_tool_call_present = True - - if is_tool_call_present: - sys.exit(0) - else: - sys.exit(1) + if is_tool_call_present: + sys.exit(0) + else: + sys.exit(1) if __name__ == "__main__": From eb5f579ef679fd3a7c82b639a1e59a79e5a906b1 Mon Sep 17 00:00:00 2001 From: Pawel Date: Mon, 9 Mar 2026 10:42:55 +0100 Subject: [PATCH 14/22] typos --- demos/continuous_batching/agentic_ai/README.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/demos/continuous_batching/agentic_ai/README.md b/demos/continuous_batching/agentic_ai/README.md index b77a8dcada..ee258f8137 100644 --- a/demos/continuous_batching/agentic_ai/README.md +++ b/demos/continuous_batching/agentic_ai/README.md @@ -108,7 +108,7 @@ Use MCP server: python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Mistral-7B-Instruct-v0.3-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather --stream ``` -Examplary output: +Exemplary output: ```text The current weather in Tokyo on March 2, 2026 is Partly cloudy with a temperature of 9 degrees Celsius. ``` @@ -273,7 +273,7 @@ Use MCP server: python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Mistral-7B-Instruct-v0.3-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather --stream ``` -Examplary output: +Exemplary output: ```text The current weather in Tokyo on March 2, 2026 is Partly cloudy with a temperature of 9 degrees Celsius. ``` @@ -365,7 +365,7 @@ Use MCP server: python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Mistral-7B-Instruct-v0.3-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather --stream ``` -Examplary output: +Exemplary output: ```text The current weather in Tokyo on March 2, 2026 is Partly cloudy with a temperature of 9 degrees Celsius. ``` From d2f45150e6e2a7873c81677da8f0a3d41bfb87c4 Mon Sep 17 00:00:00 2001 From: Pawel Date: Mon, 9 Mar 2026 12:03:51 +0100 Subject: [PATCH 15/22] fixes --- demos/continuous_batching/agentic_ai/README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/demos/continuous_batching/agentic_ai/README.md b/demos/continuous_batching/agentic_ai/README.md index ee258f8137..c071402acc 100644 --- a/demos/continuous_batching/agentic_ai/README.md +++ b/demos/continuous_batching/agentic_ai/README.md @@ -248,7 +248,7 @@ The current weather in Tokyo is clear with a temperature of 8.3°C (feels like 5 Pull and start OVMS: ```bash docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model OpenVINO/Phi-4-mini-instruct-int4-ov --tool_parser hermes3 --task text_generation +--rest_port 8000 --model_repository_path models --source_model OpenVINO/Phi-4-mini-instruct-int4-ov --tool_parser phi4 --task text_generation ``` Use MCP server: @@ -508,7 +508,7 @@ Models can be also compared using the [leaderboard reports](https://gorilla.cs.b ### Export using python script -Use those steps to convert the model from HugginFace Hub to OpenVINO format and export it to a local storage. +Use those steps to convert the model from HuggingFace Hub to OpenVINO format and export it to a local storage. ```text # Download export script, install its dependencies and create directory for the models From 210ce526b2511de8c71e0eb7fa487eff7506db8e Mon Sep 17 00:00:00 2001 From: Pawel Date: Mon, 9 Mar 2026 12:11:23 +0100 Subject: [PATCH 16/22] fix --- .../agentic_ai/openai_agent.py | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/demos/continuous_batching/agentic_ai/openai_agent.py b/demos/continuous_batching/agentic_ai/openai_agent.py index 0ee59d6984..bd7b0aee41 100644 --- a/demos/continuous_batching/agentic_ai/openai_agent.py +++ b/demos/continuous_batching/agentic_ai/openai_agent.py @@ -47,7 +47,7 @@ if https_proxy: env_proxy["https_proxy"] = https_proxy -RunConfig.tracing_disabled = False # Disable tracing for this example +RunConfig.tracing_disabled = False # Enable tracing for this example def check_if_tool_calls_present(result) -> bool: if hasattr(result, 'new_items') and result.new_items: @@ -86,12 +86,7 @@ async def run(query, agent, OVMS_MODEL_PROVIDER, stream: bool = False): result = await Runner.run(starting_agent=agent, input=query, run_config=RunConfig(model_provider=OVMS_MODEL_PROVIDER, tracing_disabled=True)) print(result.final_output) - is_tool_call_present = check_if_tool_calls_present(result) - - if is_tool_call_present: - sys.exit(0) - else: - sys.exit(1) + return check_if_tool_calls_present(result) if __name__ == "__main__": @@ -142,4 +137,9 @@ def get_model(self, _) -> Model: model_settings=ModelSettings(tool_choice=args.tool_choice, temperature=0.0, max_tokens=1000, extra_body={"chat_template_kwargs": {"enable_thinking": args.enable_thinking}}), ) loop = asyncio.new_event_loop() - loop.run_until_complete(run(args.query, agent, OVMS_MODEL_PROVIDER, args.stream)) + + is_tool_call_present = loop.run_until_complete(run(args.query, agent, OVMS_MODEL_PROVIDER, args.stream)) + if is_tool_call_present: + sys.exit(0) + else: + sys.exit(1) \ No newline at end of file From 460f172bb91e804d5363fd74a0e787c57aebdfd4 Mon Sep 17 00:00:00 2001 From: Dariusz Trawinski Date: Tue, 10 Mar 2026 15:38:24 +0100 Subject: [PATCH 17/22] fix --- demos/continuous_batching/agentic_ai/README.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/demos/continuous_batching/agentic_ai/README.md b/demos/continuous_batching/agentic_ai/README.md index c071402acc..17074a422b 100644 --- a/demos/continuous_batching/agentic_ai/README.md +++ b/demos/continuous_batching/agentic_ai/README.md @@ -24,7 +24,9 @@ docker run -d -p 8080:8080 -e PORT=8080 mcp-weather-server:sse uv run python -m ### Windows On Windows the MCP server will be demonstrated as an instance with stdio interface inside the agent application. File system MCP server requires NodeJS and npx, visit https://nodejs.org/en/download. The weather MCP should be installed as python package: -```pip install python-dateutil mcp_weather_server``` +```bat +pip install python-dateutil mcp_weather_server +``` ## Start the agent From 9d40e5d03553466a8230f9e9305d7c899cdcaa01 Mon Sep 17 00:00:00 2001 From: Pawel Date: Wed, 11 Mar 2026 08:24:05 +0100 Subject: [PATCH 18/22] save --- demos/continuous_batching/agentic_ai/README.md | 8 ++++++++ demos/continuous_batching/agentic_ai/openai_agent.py | 2 ++ 2 files changed, 10 insertions(+) diff --git a/demos/continuous_batching/agentic_ai/README.md b/demos/continuous_batching/agentic_ai/README.md index 17074a422b..0c89767842 100644 --- a/demos/continuous_batching/agentic_ai/README.md +++ b/demos/continuous_batching/agentic_ai/README.md @@ -480,6 +480,14 @@ The current weather in Tokyo is clear sky with a temperature of 8.3°C (feels li > **Note:** For more interactive mode you can run the application with streaming enabled by providing `--stream` parameter to the script. +### Using Llama index + +Pull and start OVMS: +```bash +docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models openvino/model_server:weekly \ +--rest_port 8000 --model_repository_path models --source_model OpenVINO/Qwen3-8B-int4-ov --tool_parser hermes3 --task text_generation +``` + You can try also similar implementation based on llama_index library working the same way: ```bash pip install llama-index-llms-openai-like==0.5.3 llama-index-core==0.14.5 llama-index-tools-mcp==0.4.2 diff --git a/demos/continuous_batching/agentic_ai/openai_agent.py b/demos/continuous_batching/agentic_ai/openai_agent.py index bd7b0aee41..25db857cf1 100644 --- a/demos/continuous_batching/agentic_ai/openai_agent.py +++ b/demos/continuous_batching/agentic_ai/openai_agent.py @@ -139,6 +139,8 @@ def get_model(self, _) -> Model: loop = asyncio.new_event_loop() is_tool_call_present = loop.run_until_complete(run(args.query, agent, OVMS_MODEL_PROVIDER, args.stream)) + + # for testing purposes, exit codes are dependent on whether a tool call was present in the agent's reasoning process if is_tool_call_present: sys.exit(0) else: From d5f8846d9046c726d9ab59bac51e1bded82e5f85 Mon Sep 17 00:00:00 2001 From: Pawel Date: Wed, 11 Mar 2026 08:42:40 +0100 Subject: [PATCH 19/22] fix --- demos/continuous_batching/agentic_ai/README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/demos/continuous_batching/agentic_ai/README.md b/demos/continuous_batching/agentic_ai/README.md index 0c89767842..e3a42b271d 100644 --- a/demos/continuous_batching/agentic_ai/README.md +++ b/demos/continuous_batching/agentic_ai/README.md @@ -35,6 +35,7 @@ Install the application requirements ```console curl https://raw.githubusercontent.com/openvinotoolkit/model_server/main/demos/continuous_batching/agentic_ai/openai_agent.py -O -L pip install openai-agents openai +mkdir models ``` ## Start OVMS From b9b928d95d50b44209644ae2aa0c8b75ca96e2bb Mon Sep 17 00:00:00 2001 From: Pawel Date: Wed, 11 Mar 2026 10:33:53 +0100 Subject: [PATCH 20/22] delete mistral --- .../continuous_batching/agentic_ai/README.md | 51 ------------------- 1 file changed, 51 deletions(-) diff --git a/demos/continuous_batching/agentic_ai/README.md b/demos/continuous_batching/agentic_ai/README.md index e3a42b271d..4648d95e02 100644 --- a/demos/continuous_batching/agentic_ai/README.md +++ b/demos/continuous_batching/agentic_ai/README.md @@ -99,23 +99,6 @@ Exemplary output: The current weather in Tokyo is partly cloudy with a temperature of 8.4°C. The relative humidity is quite high at 91%, and the dew point is at 7.0°C, indicating that the air is moist. Winds are coming from the southeast at a gentle breeze of 4.2 km/h, with gusts reaching up to 15.5 km/h. The atmospheric pressure is steady at 1016.0 hPa, and cloud cover is at 72%. Visibility is excellent at 24.1 km, suggesting clear conditions for most outdoor activities. ``` ::: -:::{tab-item} Mistral-7B-Instruct-v0.3 -:sync: Mistral-7B-Instruct-v0.3 -Pull and start OVMS: -```bat -ovms.exe --rest_port 8000 --model_repository_path models --source_model OpenVINO/Mistral-7B-Instruct-v0.3-int4-ov --tool_parser mistral --target_device GPU --task text_generation -``` - -Use MCP server: -```bat -python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Mistral-7B-Instruct-v0.3-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather --stream -``` - -Exemplary output: -```text -The current weather in Tokyo on March 2, 2026 is Partly cloudy with a temperature of 9 degrees Celsius. -``` -::: :::{tab-item} Qwen3-Coder-30B-A3B-Instruct :sync: Qwen3-Coder-30B-A3B-Instruct Pull and start OVMS: @@ -264,23 +247,6 @@ Exemplary output: The current weather in Tokyo is mostly clear with a temperature of 12.4°C. The relative humidity is at 68%, and the dew point is at 6.7°C. Winds are coming from the SSE at a speed of 5.3 km/h, with gusts reaching up to 25.2 km/h. The atmospheric pressure is 1017.9 hPa, and there is a 23% cloud cover. Visibility is good at 24.1 km. ``` ::: -:::{tab-item} Mistral-7B-Instruct-v0.3 -:sync: Mistral-7B-Instruct-v0.3 -Pull and start OVMS: -```bash -docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models openvino/model_server:latest --rest_port 8000 --model_repository_path models --source_model OpenVINO/Mistral-7B-Instruct-v0.3-int4-ov --tool_parser mistral --task text_generation -``` - -Use MCP server: -```bash -python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Mistral-7B-Instruct-v0.3-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather --stream -``` - -Exemplary output: -```text -The current weather in Tokyo on March 2, 2026 is Partly cloudy with a temperature of 9 degrees Celsius. -``` -::: :::{tab-item} Qwen3-Coder-30B-A3B-Instruct :sync: Qwen3-Coder-30B-A3B-Instruct Pull and start OVMS: @@ -356,23 +322,6 @@ Exemplary output: The current weather in Tokyo is mainly clear. The temperature is 11.7°C, with a relative humidity of 74% and a dew point of 7.2°C. The wind is coming from the SSE at 4.2 km/h, with gusts up to 22.7 km/h. The atmospheric pressure is 1018.0 hPa, with 44% cloud cover. Visibility is 24.1 km. ``` ::: -:::{tab-item} Mistral-7B-Instruct-v0.3 -:sync: Mistral-7B-Instruct-v0.3 -Pull and start OVMS: -```bash -docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:latest-gpu --rest_port 8000 --model_repository_path models --source_model OpenVINO/Mistral-7B-Instruct-v0.3-int4-ov --tool_parser mistral --target_device GPU --task text_generation -``` - -Use MCP server: -```bash -python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Mistral-7B-Instruct-v0.3-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather --stream -``` - -Exemplary output: -```text -The current weather in Tokyo on March 2, 2026 is Partly cloudy with a temperature of 9 degrees Celsius. -``` -::: :::{tab-item} Qwen3-Coder-30B-A3B-Instruct :sync: Qwen3-Coder-30B-A3B-Instruct Pull and start OVMS: From 509b3fa568eb1888da910d4b7c633bcaa493861a Mon Sep 17 00:00:00 2001 From: Dariusz Trawinski Date: Thu, 12 Mar 2026 14:43:04 +0100 Subject: [PATCH 21/22] model repository changes --- .../continuous_batching/agentic_ai/README.md | 90 ++++++++++--------- 1 file changed, 50 insertions(+), 40 deletions(-) diff --git a/demos/continuous_batching/agentic_ai/README.md b/demos/continuous_batching/agentic_ai/README.md index 4648d95e02..8ea1b49273 100644 --- a/demos/continuous_batching/agentic_ai/README.md +++ b/demos/continuous_batching/agentic_ai/README.md @@ -35,7 +35,6 @@ Install the application requirements ```console curl https://raw.githubusercontent.com/openvinotoolkit/model_server/main/demos/continuous_batching/agentic_ai/openai_agent.py -O -L pip install openai-agents openai -mkdir models ``` ## Start OVMS @@ -52,7 +51,7 @@ as mentioned in [deployment guide](../../../docs/deploying_server_baremetal.md), :sync: Qwen3-8B Pull and start OVMS: ```bat -ovms.exe --rest_port 8000 --source_model OpenVINO/Qwen3-8B-int4-ov --model_repository_path models --tool_parser hermes3 --target_device GPU --task text_generation --cache_dir .cache +ovms.exe --rest_port 8000 --source_model OpenVINO/Qwen3-8B-int4-ov --model_repository_path c:\models --tool_parser hermes3 --target_device GPU --task text_generation --cache_dir .cache ``` Use MCP server: @@ -69,7 +68,7 @@ The current weather in Tokyo is partly cloudy with a temperature of 8.4°C. The :sync: Qwen3-4B Pull and start OVMS: ```bat -ovms.exe --rest_port 8000 --source_model OpenVINO/Qwen3-4B-int4-ov --model_repository_path models --tool_parser hermes3 --target_device GPU --task text_generation --cache_dir .cache +ovms.exe --rest_port 8000 --source_model OpenVINO/Qwen3-4B-int4-ov --model_repository_path c:\models --tool_parser hermes3 --target_device GPU --task text_generation --cache_dir .cache ``` Use MCP server: @@ -86,7 +85,7 @@ The current weather in Tokyo is partly cloudy with a temperature of 8.4°C. The :sync: Phi-4-mini-instruct Pull and start OVMS: ```bat -ovms.exe --rest_port 8000 --source_model OpenVINO/Phi-4-mini-instruct-int4-ov --model_repository_path models --tool_parser phi4 --target_device GPU --task text_generation --enable_tool_guided_generation true --cache_dir .cache --max_num_batched_tokens 99999 +ovms.exe --rest_port 8000 --source_model OpenVINO/Phi-4-mini-instruct-int4-ov --model_repository_path c:\models --tool_parser phi4 --target_device GPU --task text_generation --enable_tool_guided_generation true --cache_dir .cache --max_num_batched_tokens 99999 ``` Use MCP server: @@ -99,17 +98,17 @@ Exemplary output: The current weather in Tokyo is partly cloudy with a temperature of 8.4°C. The relative humidity is quite high at 91%, and the dew point is at 7.0°C, indicating that the air is moist. Winds are coming from the southeast at a gentle breeze of 4.2 km/h, with gusts reaching up to 15.5 km/h. The atmospheric pressure is steady at 1016.0 hPa, and cloud cover is at 72%. Visibility is excellent at 24.1 km, suggesting clear conditions for most outdoor activities. ``` ::: -:::{tab-item} Qwen3-Coder-30B-A3B-Instruct -:sync: Qwen3-Coder-30B-A3B-Instruct +:::{tab-item} Qwen3-30B-A3B-Instruct-2507 +:sync: Qwen3-30B-A3B-Instruct-2507 Pull and start OVMS: ```bat set MOE_USE_MICRO_GEMM_PREFILL=0 -ovms.exe --rest_port 8000 --source_model OpenVINO/Qwen3-Coder-30B-A3B-Instruct-int4-ov --model_repository_path models --tool_parser qwen3coder --target_device GPU --task text_generation --cache_dir .cache +ovms.exe --rest_port 8000 --source_model OpenVINO/Qwen3-30B-A3B-Instruct-2507-int4-ov --model_repository_path c:\models --tool_parser hermes3 --target_device GPU --task text_generation --cache_dir .cache ``` Use MCP server: ```bat -python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Qwen3-Coder-30B-A3B-Instruct-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather +python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Qwen3-30B-A3B-Instruct-2507-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather ``` Exemplary output: @@ -121,7 +120,7 @@ The current weather in Tokyo is mainly clear with a temperature of 8.7°C. The r :sync: gpt-oss-20b Pull and start OVMS: ```bat -ovms.exe --rest_port 8000 --source_model OpenVINO/gpt-oss-20b-int4-ov --model_repository_path models --tool_parser gptoss --reasoning_parser gptoss --task text_generation --target_device GPU +ovms.exe --rest_port 8000 --source_model OpenVINO/gpt-oss-20b-int4-ov --model_repository_path c:\models --tool_parser gptoss --reasoning_parser gptoss --task text_generation --target_device GPU ``` > **Note:** Continuous batching and paged attention are supported for GPT‑OSS. However, when deployed on GPU, the model may experience reduced accuracy under high‑concurrency workloads. This issue will be resolved in version 2026.1 and in the upcoming weekly release. CPU execution is not affected. @@ -156,7 +155,7 @@ Let me know if you’d like more details or a forecast! :sync: Qwen3-8B Pull and start OVMS: ```bat -ovms.exe --rest_port 8000 --source_model OpenVINO/Qwen3-8B-int4-cw-ov --model_repository_path models --tool_parser hermes3 --target_device NPU --task text_generation --cache_dir .cache --max_prompt_len 8000 +ovms.exe --rest_port 8000 --source_model OpenVINO/Qwen3-8B-int4-cw-ov --model_repository_path c:\models --tool_parser hermes3 --target_device NPU --task text_generation --cache_dir .cache --max_prompt_len 8000 ``` Use MCP server: @@ -173,7 +172,7 @@ The current weather in Tokyo is mainly clear with a temperature of 11.5°C. The :sync: Qwen3-4B Pull and start OVMS: ```bat -ovms.exe --rest_port 8000 --source_model FluidInference/qwen3-4b-int4-ov-npu --model_repository_path models --tool_parser hermes3 --target_device NPU --task text_generation --cache_dir .cache --max_prompt_len 8000 +ovms.exe --rest_port 8000 --source_model FluidInference/qwen3-4b-int4-ov-npu --model_repository_path c:\models --tool_parser hermes3 --target_device NPU --task text_generation --cache_dir .cache --max_prompt_len 8000 ``` Use MCP server: @@ -197,8 +196,9 @@ The current weather in Tokyo is mainly clear, with a temperature of 11.5°C. The :sync: Qwen3-8B Pull and start OVMS: ```bash -docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model OpenVINO/Qwen3-8B-int4-ov --tool_parser hermes3 --task text_generation +mkdir -p ${HOME}/models +docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v ${HOME}/models:/models openvino/model_server:weekly \ +--rest_port 8000 --model_repository_path ${HOME}/models --source_model OpenVINO/Qwen3-8B-int4-ov --tool_parser hermes3 --task text_generation ``` Use MCP server: @@ -215,7 +215,8 @@ The current weather in Tokyo is clear sky with a temperature of 8.3°C (feels li :sync: Qwen3-4B Pull and start OVMS: ```bash -docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models openvino/model_server:weekly \ +mkdir -p ${HOME}/models +docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v ${HOME}/models:/models openvino/model_server:weekly \ --rest_port 8000 --model_repository_path models --source_model OpenVINO/Qwen3-4B-int4-ov --tool_parser hermes3 --task text_generation ``` @@ -233,8 +234,9 @@ The current weather in Tokyo is clear with a temperature of 8.3°C (feels like 5 :sync: Phi-4-mini-instruct Pull and start OVMS: ```bash -docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model OpenVINO/Phi-4-mini-instruct-int4-ov --tool_parser phi4 --task text_generation +mkdir -p ${HOME}/models +docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v ${HOME}/models:/models openvino/model_server:weekly \ +--rest_port 8000 --model_repository_path /models --source_model OpenVINO/Phi-4-mini-instruct-int4-ov --tool_parser phi4 --task text_generation ``` Use MCP server: @@ -247,17 +249,18 @@ Exemplary output: The current weather in Tokyo is mostly clear with a temperature of 12.4°C. The relative humidity is at 68%, and the dew point is at 6.7°C. Winds are coming from the SSE at a speed of 5.3 km/h, with gusts reaching up to 25.2 km/h. The atmospheric pressure is 1017.9 hPa, and there is a 23% cloud cover. Visibility is good at 24.1 km. ``` ::: -:::{tab-item} Qwen3-Coder-30B-A3B-Instruct -:sync: Qwen3-Coder-30B-A3B-Instruct +:::{tab-item} Qwen3-30B-A3B-Instruct-2507 +:sync: Qwen3-30B-A3B-Instruct-2507 Pull and start OVMS: ```bash -docker run -d --user $(id -u):$(id -g) --rm -e MOE_USE_MICRO_GEMM_PREFILL=0 -p 8000:8000 -v $(pwd)/models:/models openvino/model_server:weekly \ ---rest_port 8000 --source_model OpenVINO/Qwen3-Coder-30B-A3B-Instruct-int4-ov --model_repository_path models --tool_parser qwen3coder --task text_generation +mkdir -p ${HOME}/models +docker run -d --user $(id -u):$(id -g) --rm -e MOE_USE_MICRO_GEMM_PREFILL=0 -p 8000:8000 -v ${HOME}/models:/models openvino/model_server:weekly \ +--rest_port 8000 --source_model OpenVINO/Qwen3-30B-A3B-Instruct-2507-int4-ov --model_repository_path /models --tool_parser hermes3 --task text_generation ``` Use MCP server: ```bash -python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Qwen3-Coder-30B-A3B-Instruct-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather +python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Qwen3-30B-A3B-Instruct-2507-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather ``` Exemplary output: @@ -290,8 +293,9 @@ It can be applied using the commands below: :sync: Qwen3-8B Pull and start OVMS: ```bash -docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model OpenVINO/Qwen3-8B-int4-ov --tool_parser hermes3 --target_device GPU --task text_generation +mkdir -p ${HOME}/models +docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v ${HOME}/models:/models --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly \ +--rest_port 8000 --model_repository_path /models --source_model OpenVINO/Qwen3-8B-int4-ov --tool_parser hermes3 --target_device GPU --task text_generation ``` Use MCP server: @@ -308,8 +312,9 @@ The current weather in Tokyo is mainly clear with a temperature of 11.7°C. The :sync: Qwen3-4B Pull and start OVMS: ```bash -docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model OpenVINO/Qwen3-4B-int4-ov --tool_parser hermes3 --target_device GPU --task text_generation +mkdir -p ${HOME}/models +docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v ${HOME}/models:/models --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly \ +--rest_port 8000 --model_repository_path /models --source_model OpenVINO/Qwen3-4B-int4-ov --tool_parser hermes3 --target_device GPU --task text_generation ``` Use MCP server: @@ -322,17 +327,18 @@ Exemplary output: The current weather in Tokyo is mainly clear. The temperature is 11.7°C, with a relative humidity of 74% and a dew point of 7.2°C. The wind is coming from the SSE at 4.2 km/h, with gusts up to 22.7 km/h. The atmospheric pressure is 1018.0 hPa, with 44% cloud cover. Visibility is 24.1 km. ``` ::: -:::{tab-item} Qwen3-Coder-30B-A3B-Instruct -:sync: Qwen3-Coder-30B-A3B-Instruct +:::{tab-item} Qwen3-30B-A3B-Instruct-2507-int4-ov +:sync: Qwen3-30B-A3B-Instruct-2507-int4-ov Pull and start OVMS: ```bash -docker run -d --user $(id -u):$(id -g) -e MOE_USE_MICRO_GEMM_PREFILL=0 --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly \ ---rest_port 8000 --source_model OpenVINO/Qwen3-Coder-30B-A3B-Instruct-int4-ov --model_repository_path models --tool_parser qwen3coder --target_device GPU --task text_generation --enable_tool_guided_generation true +mkdir ${HOME}/models +docker run -d --user $(id -u):$(id -g) -e MOE_USE_MICRO_GEMM_PREFILL=0 --rm -p 8000:8000 -v ${HOME}/models:/models --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly \ +--rest_port 8000 --source_model OpenVINO/Qwen3-30B-A3B-Instruct-2507-int4-ov --model_repository_path /models --tool_parser hermes3 --target_device GPU --task text_generation --enable_tool_guided_generation true ``` Use MCP server: ```bash -python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Qwen3-Coder-30B-A3B-Instruct-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather +python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Qwen3-30B-A3B-Instruct-2507-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather ``` Exemplary output: @@ -352,8 +358,9 @@ The current weather in Tokyo is as follows: :sync: gpt-oss-20b Pull and start OVMS: ```bash -docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly \ ---rest_port 8000 --source_model OpenVINO/gpt-oss-20b-int4-ov --model_repository_path models \ +mkdir -p ${HOME}/models +docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v ${HOME}/models:/models --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly \ +--rest_port 8000 --source_model OpenVINO/gpt-oss-20b-int4-ov --model_repository_path /models \ --tool_parser gptoss --reasoning_parser gptoss --target_device GPU --task text_generation ``` > **Note:** Continuous batching and paged attention are supported for GPT‑OSS. However, when deployed on GPU, the model may experience reduced accuracy under high‑concurrency workloads. This issue will be resolved in version 2026.1 and in the upcoming weekly release. CPU execution is not affected. @@ -392,8 +399,9 @@ It can be applied using the commands below: :sync: Qwen3-8B Pull and start OVMS: ```bash -docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/accel --group-add=$(stat -c "%g" /dev/dri/render* | head -1) openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model OpenVINO/Qwen3-8B-int4-cw-ov --tool_parser hermes3 --target_device NPU --task text_generation --max_prompt_len 8000 +mkdir -p ${HOME}/models +docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v ${HOME}/models:/models --device /dev/accel --group-add=$(stat -c "%g" /dev/dri/render* | head -1) openvino/model_server:weekly \ +--rest_port 8000 --model_repository_path /models --source_model OpenVINO/Qwen3-8B-int4-cw-ov --tool_parser hermes3 --target_device NPU --task text_generation --max_prompt_len 8000 ``` Use MCP server: @@ -410,8 +418,9 @@ The current weather in Tokyo is clear sky with a temperature of 8.3°C (feels li :sync: Qwen3-4B Pull and start OVMS: ```bash -docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/accel --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model FluidInference/qwen3-4b-int4-ov-npu --tool_parser hermes3 --target_device NPU --task text_generation --max_prompt_len 8000 +mkdir -p ${HOME}/models +docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v ${HOME}/models:/models --device /dev/accel --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly \ +--rest_port 8000 --model_repository_path /models --source_model FluidInference/qwen3-4b-int4-ov-npu --tool_parser hermes3 --target_device NPU --task text_generation --max_prompt_len 8000 ``` Use MCP server: @@ -430,15 +439,16 @@ The current weather in Tokyo is clear sky with a temperature of 8.3°C (feels li > **Note:** For more interactive mode you can run the application with streaming enabled by providing `--stream` parameter to the script. -### Using Llama index +### Using Llama index agentic framework Pull and start OVMS: ```bash -docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model OpenVINO/Qwen3-8B-int4-ov --tool_parser hermes3 --task text_generation +mkdir -p ${HOME}/models +docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v ${HOME}/models:/models openvino/model_server:weekly \ +--rest_port 8000 --model_repository_path /models --source_model OpenVINO/Qwen3-8B-int4-ov --tool_parser hermes3 --task text_generation ``` -You can try also similar implementation based on llama_index library working the same way: +You can try also similar implementation based on llama_index library working the same way like openai-agent: ```bash pip install llama-index-llms-openai-like==0.5.3 llama-index-core==0.14.5 llama-index-tools-mcp==0.4.2 curl https://raw.githubusercontent.com/openvinotoolkit/model_server/main/demos/continuous_batching/agentic_ai/llama_index_agent.py -o llama_index_agent.py From 216d61f4b1ecd41a4e043d5af14fb30daf704090 Mon Sep 17 00:00:00 2001 From: Dariusz Trawinski Date: Thu, 12 Mar 2026 15:13:45 +0100 Subject: [PATCH 22/22] fix --- demos/continuous_batching/agentic_ai/README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/demos/continuous_batching/agentic_ai/README.md b/demos/continuous_batching/agentic_ai/README.md index 8ea1b49273..359b2cadee 100644 --- a/demos/continuous_batching/agentic_ai/README.md +++ b/demos/continuous_batching/agentic_ai/README.md @@ -198,7 +198,7 @@ Pull and start OVMS: ```bash mkdir -p ${HOME}/models docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v ${HOME}/models:/models openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path ${HOME}/models --source_model OpenVINO/Qwen3-8B-int4-ov --tool_parser hermes3 --task text_generation +--rest_port 8000 --model_repository_path /models --source_model OpenVINO/Qwen3-8B-int4-ov --tool_parser hermes3 --task text_generation ``` Use MCP server: @@ -217,7 +217,7 @@ Pull and start OVMS: ```bash mkdir -p ${HOME}/models docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v ${HOME}/models:/models openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model OpenVINO/Qwen3-4B-int4-ov --tool_parser hermes3 --task text_generation +--rest_port 8000 --model_repository_path /models --source_model OpenVINO/Qwen3-4B-int4-ov --tool_parser hermes3 --task text_generation ``` Use MCP server: