Skip to content

Qwen3-Embedding fails on optimum/emb path — OVModelForFeatureExtraction doesn't provide position_ids #106

@WinePaster

Description

@WinePaster

Hi,
I'm hitting a bug when trying to use Qwen3-Embedding models (Echo9Zulu/Qwen3-Embedding-0.6B-int8_asym-ov).

What happens

  • On GPU: immediate crash with
    Got unexpected inputs: expecting {'input_ids', 'attention_mask', 'position_ids'} but got {'input_ids', 'attention_mask'}

  • On CPU: the forward pass actually runs, but everything comes back as NaN, then the server blows up with
    ValueError: Out of range float values are not JSON compliant: nan

Interestingly, the Qwen3-Reranker (same model family, same IR input shape) works perfectly fine because it goes through optimum_rr.py which uses OVModelForCausalLM.

Environment

  • Windows 11 Pro
  • Intel Core Ultra 9 285 + Arc iGPU
  • Python 3.12
  • OpenArc commit: 8856d1d9c2b8a04c1a03143ed0c633a9ebf40987
  • openvino: 2026.2.0-21876-6b2466c964b
  • optimum-intel: 1.27.0.dev0+c877c15

Quick reproduction

openarc serve start --host 127.0.0.1

Then load with:

{
  "model_name": "qwen3-emb-test",
  "model_path": "C:\\path\\to\\Qwen3-Embedding-0.6B-int8_asym-ov",
  "model_type": "emb",
  "engine": "optimum",
  "device": "CPU"   // or "GPU"
}

Hit /v1/embeddings → either NaN or the input mismatch error.

I already confirmed the IR really does expose position_ids as a required input (3 inputs total).

Root cause (as I understand it)

In src/engine/optimum/optimum_emb.py:84 it does:

self.model = OVModelForFeatureExtraction.from_pretrained(...)

OVModelForFeatureExtraction is designed for encoder-only models (BERT-style), where position_ids are baked into the IR. Decoder-style embedding models like Qwen3-Embedding (causal LM + last-token pooling + RoPE) export position_ids as an explicit input.

The reranker path avoids this because OVModelForCausalLM automatically builds position_ids from the attention mask. That's why the reranker works but the embedding path doesn't.

Current fix I'm trying

In generate_embeddings, right before calling the model, add a small check:

import torch

batch_dict = self.tokenizer(...)  # unchanged

# Add position_ids if the IR expects it
expected = {name for inp in self.model.model.inputs for name in inp.get_names()}
if "position_ids" in expected and "position_ids" not in batch_dict:
    attn = batch_dict["attention_mask"]
    batch_dict["position_ids"] = (attn.long().cumsum(-1) - 1).clamp(min=0) * attn

outputs = self.model(**batch_dict)

This is exactly how LlamaModel builds positions internally, and it produces valid embeddings when I test it with raw OpenVINO inference.

A cleaner long-term fix would be to detect decoder-style embeddings and use OVModelForCausalLM + output_hidden_states=True instead, but the 4-line patch above is low-risk and fixes it immediately.

I confirm this is not a model conversion issue

  • The exported openvino_model.xml / .bin are fine.
  • Same IR works perfectly when I manually supply position_ids via raw openvino.Core.
  • The reranker has the exact same 3-input signature and works today.

Small secondary bug

On GPU, the real error gets logged in openarc.log, but the client request just hangs until timeout. The FastAPI exception handler doesn't seem to propagate it properly for the embeddings endpoint. Not critical, but worth fixing later.

--

Hope someone can take a look. Happy to provide more logs or test any patch.

Thanks!

P

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions