Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 3 additions & 8 deletions demos/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,16 +5,13 @@
maxdepth: 1
hidden:
---
ovms_demos_continuous_batching_agent
ovms_demos_continuous_batching
ovms_demos_integration_with_open_webui
ovms_demos_code_completion_vsc
ovms_demos_audio
ovms_demos_rerank
ovms_demos_embeddings
ovms_demos_continuous_batching
ovms_demo_long_context
ovms_demos_continuous_batching_vlm
ovms_demos_llm_npu
ovms_demos_vlm_npu
ovms_demos_code_completion_vsc
ovms_demos_image_generation
ovms_demo_clip_image_classification
ovms_demo_age_gender_guide
Expand All @@ -40,10 +37,8 @@ ovms_demo_real_time_stream_analysis
ovms_demo_using_paddlepaddle_model
ovms_demo_bert
ovms_demo_universal-sentence-encoder
ovms_demo_benchmark_client
ovms_string_output_model_demo
ovms_demos_gguf
ovms_demos_audio

```

Expand Down
337 changes: 114 additions & 223 deletions demos/continuous_batching/README.md

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion demos/continuous_batching/speculative_decoding/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# How to serve LLM Models in Speculative Decoding Pipeline{#ovms_demos_continuous_batching_speculative_decoding}
# LLM Models in Speculative Decoding Pipeline{#ovms_demos_continuous_batching_speculative_decoding}

Following [OpenVINO GenAI docs](https://docs.openvino.ai/2026/openvino-workflow-generative/inference-with-genai.html#efficient-text-generation-via-speculative-decoding):
> Speculative decoding (or assisted-generation) enables faster token generation when an additional smaller draft model is used alongside the main model. This reduces the number of infer requests to the main model, increasing performance.
Expand Down
205 changes: 36 additions & 169 deletions demos/continuous_batching/vlm/README.md

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion demos/embeddings/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# How to serve Embeddings models via OpenAI API {#ovms_demos_embeddings}
# Text Embeddings models via OpenAI API {#ovms_demos_embeddings}
This demo shows how to deploy embeddings models in the OpenVINO Model Server for text feature extractions.
Text generation use case is exposed via OpenAI API `embeddings` endpoint.

Expand Down
6 changes: 4 additions & 2 deletions demos/integration_with_OpenWebUI/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Demonstrating integration of Open WebUI with OpenVINO Model Server {#ovms_demos_integration_with_open_webui}
# Open WebUI with OpenVINO Model Server {#ovms_demos_integration_with_open_webui}

## Description

Expand Down Expand Up @@ -70,7 +70,9 @@ Go to [http://localhost:8080](http://localhost:8080) and create admin account to

![get started with Open WebUI](./get_started_with_Open_WebUI.png)

### Reference
> **Important Note**: While using NPU device for acceleration and model gpt-oss-20b with GPU, it is recommended to disable `Follow-Up Auto-Generation` in `Settings > Interface` menu. It will improve response time and avoid queuing requests. For gpt-oss model it will avoid concurrent execution which in version 2026.0 has an accuracy issue.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
> **Important Note**: While using NPU device for acceleration and model gpt-oss-20b with GPU, it is recommended to disable `Follow-Up Auto-Generation` in `Settings > Interface` menu. It will improve response time and avoid queuing requests. For gpt-oss model it will avoid concurrent execution which in version 2026.0 has an accuracy issue.
> **Important Note**: While using NPU device for acceleration or model gpt-oss-20b with GPU, it is recommended to disable `Follow-Up Auto-Generation` in `Settings > Interface` menu. It will improve response time and avoid queuing requests. For gpt-oss model it will avoid concurrent execution which in version 2026.0 has an accuracy issue.


### References
[https://docs.openvino.ai/2026/model-server/ovms_demos_continuous_batching.html](https://docs.openvino.ai/2026/model-server/ovms_demos_continuous_batching.html#model-preparation)

[https://docs.openwebui.com](https://docs.openwebui.com/#installation-with-pip)
Expand Down
4 changes: 2 additions & 2 deletions demos/rerank/README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# How to serve Rerank models via Cohere API {#ovms_demos_rerank}
# Documents Reranking via Cohere API {#ovms_demos_rerank}

## Prerequisites

**Model preparation**: Python 3.9 or higher with pip
**Model preparation**: Python 3.10 or higher with pip

**Model Server deployment**: Installed Docker Engine or OVMS binary package according to the [baremetal deployment guide](../../docs/deploying_server_baremetal.md)

Expand Down
6 changes: 2 additions & 4 deletions demos/vlm_npu/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Serving for Text generation with Visual Language Models with NPU acceleration {#ovms_demos_vlm_npu}
# NPU for Visual Language Models {#ovms_demos_vlm_npu}


This demo shows how to deploy VLM models in the OpenVINO Model Server with NPU acceleration.
Expand All @@ -11,9 +11,7 @@ It is targeted on client machines equipped with NPU accelerator.

## Prerequisites

**OVMS 2025.1 or higher**

**Model preparation**: Python 3.9 or higher with pip and HuggingFace account
**Model preparation**: Python 3.10 or higher with pip and HuggingFace account

**Model Server deployment**: Installed Docker Engine or OVMS binary package according to the [baremetal deployment guide](../../docs/deploying_server_baremetal.md)

Expand Down