Hi,
I'm hitting a bug when trying to use Qwen3-Embedding models (Echo9Zulu/Qwen3-Embedding-0.6B-int8_asym-ov).
What happens
-
On GPU: immediate crash with
Got unexpected inputs: expecting {'input_ids', 'attention_mask', 'position_ids'} but got {'input_ids', 'attention_mask'}
-
On CPU: the forward pass actually runs, but everything comes back as NaN, then the server blows up with
ValueError: Out of range float values are not JSON compliant: nan
Interestingly, the Qwen3-Reranker (same model family, same IR input shape) works perfectly fine because it goes through optimum_rr.py which uses OVModelForCausalLM.
Environment
- Windows 11 Pro
- Intel Core Ultra 9 285 + Arc iGPU
- Python 3.12
- OpenArc commit:
8856d1d9c2b8a04c1a03143ed0c633a9ebf40987
- openvino:
2026.2.0-21876-6b2466c964b
- optimum-intel:
1.27.0.dev0+c877c15
Quick reproduction
openarc serve start --host 127.0.0.1
Then load with:
{
"model_name": "qwen3-emb-test",
"model_path": "C:\\path\\to\\Qwen3-Embedding-0.6B-int8_asym-ov",
"model_type": "emb",
"engine": "optimum",
"device": "CPU" // or "GPU"
}
Hit /v1/embeddings → either NaN or the input mismatch error.
I already confirmed the IR really does expose position_ids as a required input (3 inputs total).
Root cause (as I understand it)
In src/engine/optimum/optimum_emb.py:84 it does:
self.model = OVModelForFeatureExtraction.from_pretrained(...)
OVModelForFeatureExtraction is designed for encoder-only models (BERT-style), where position_ids are baked into the IR. Decoder-style embedding models like Qwen3-Embedding (causal LM + last-token pooling + RoPE) export position_ids as an explicit input.
The reranker path avoids this because OVModelForCausalLM automatically builds position_ids from the attention mask. That's why the reranker works but the embedding path doesn't.
Current fix I'm trying
In generate_embeddings, right before calling the model, add a small check:
import torch
batch_dict = self.tokenizer(...) # unchanged
# Add position_ids if the IR expects it
expected = {name for inp in self.model.model.inputs for name in inp.get_names()}
if "position_ids" in expected and "position_ids" not in batch_dict:
attn = batch_dict["attention_mask"]
batch_dict["position_ids"] = (attn.long().cumsum(-1) - 1).clamp(min=0) * attn
outputs = self.model(**batch_dict)
This is exactly how LlamaModel builds positions internally, and it produces valid embeddings when I test it with raw OpenVINO inference.
A cleaner long-term fix would be to detect decoder-style embeddings and use OVModelForCausalLM + output_hidden_states=True instead, but the 4-line patch above is low-risk and fixes it immediately.
I confirm this is not a model conversion issue
- The exported
openvino_model.xml / .bin are fine.
- Same IR works perfectly when I manually supply
position_ids via raw openvino.Core.
- The reranker has the exact same 3-input signature and works today.
Small secondary bug
On GPU, the real error gets logged in openarc.log, but the client request just hangs until timeout. The FastAPI exception handler doesn't seem to propagate it properly for the embeddings endpoint. Not critical, but worth fixing later.
--
Hope someone can take a look. Happy to provide more logs or test any patch.
Thanks!
P
Hi,
I'm hitting a bug when trying to use Qwen3-Embedding models (Echo9Zulu/Qwen3-Embedding-0.6B-int8_asym-ov).
What happens
On GPU: immediate crash with
Got unexpected inputs: expecting {'input_ids', 'attention_mask', 'position_ids'} but got {'input_ids', 'attention_mask'}On CPU: the forward pass actually runs, but everything comes back as
NaN, then the server blows up withValueError: Out of range float values are not JSON compliant: nanInterestingly, the Qwen3-Reranker (same model family, same IR input shape) works perfectly fine because it goes through
optimum_rr.pywhich usesOVModelForCausalLM.Environment
8856d1d9c2b8a04c1a03143ed0c633a9ebf409872026.2.0-21876-6b2466c964b1.27.0.dev0+c877c15Quick reproduction
Then load with:
{ "model_name": "qwen3-emb-test", "model_path": "C:\\path\\to\\Qwen3-Embedding-0.6B-int8_asym-ov", "model_type": "emb", "engine": "optimum", "device": "CPU" // or "GPU" }Hit
/v1/embeddings→ either NaN or the input mismatch error.I already confirmed the IR really does expose
position_idsas a required input (3 inputs total).Root cause (as I understand it)
In
src/engine/optimum/optimum_emb.py:84it does:OVModelForFeatureExtractionis designed for encoder-only models (BERT-style), whereposition_idsare baked into the IR. Decoder-style embedding models like Qwen3-Embedding (causal LM + last-token pooling + RoPE) exportposition_idsas an explicit input.The reranker path avoids this because
OVModelForCausalLMautomatically buildsposition_idsfrom the attention mask. That's why the reranker works but the embedding path doesn't.Current fix I'm trying
In
generate_embeddings, right before calling the model, add a small check:This is exactly how
LlamaModelbuilds positions internally, and it produces valid embeddings when I test it with raw OpenVINO inference.A cleaner long-term fix would be to detect decoder-style embeddings and use
OVModelForCausalLM+output_hidden_states=Trueinstead, but the 4-line patch above is low-risk and fixes it immediately.I confirm this is not a model conversion issue
openvino_model.xml/.binare fine.position_idsvia rawopenvino.Core.Small secondary bug
On GPU, the real error gets logged in
openarc.log, but the client request just hangs until timeout. The FastAPI exception handler doesn't seem to propagate it properly for the embeddings endpoint. Not critical, but worth fixing later.--
Hope someone can take a look. Happy to provide more logs or test any patch.
Thanks!
P