PyPTO Serving is a small local inference stack for running Qwen3-14B generation with PyPTO kernels on Ascend NPUs. It includes a reusable Python runtime, Qwen3-14B executor glue, CLI entry points, and tests for batching and config handling.
python/
cli/ pypto-serving CLI implementation
core/ engine, scheduler, KV cache, model loading, async serving
runtime/ Simpler worker wrapper for NPU dispatch
pypto-lib/ submodule providing Qwen3-14B PyPTO kernels
examples/
pypto-serving executable CLI wrapper
model/qwen3_14b/
cpu_generate.py CPU reference generation example
npu_generate.py NPU generation/profiling example
npu_serving.json sample serving config
runner/ Qwen3 executors and runner glue
src/ PyPTO kernel/program builders
tests/ CLI, batching, E2E serving, and benchmark tests
Initialize the kernel submodule after cloning:
git submodule update --init --recursiveRun the unit tests:
python -m pytest tests/test_cli.py tests/test_batching.pyShow CLI help:
./examples/pypto-serving --help
python -m python.cli --helpOne-shot generation, non-L3 path:
python examples/model/qwen3_14b/npu_generate.py \
--model-dir /data/linyifan/models/Qwen3-14B \
--prompt 'Huawei is' \
--platform a2a3 \
--max-seq-len 512 \
--max-new-tokens 5One-shot generation, L3 path:
python examples/model/qwen3_14b/npu_generate.py \
--model-dir /data/linyifan/models/Qwen3-14B \
--prompt 'Huawei is' \
--platform a2a3 \
--max-seq-len 512 \
--max-new-tokens 5 \
--l3Interactive generation:
./examples/pypto-serving \
--config examples/model/qwen3_14b/npu_serving.json \
--device 0 \
--interactiveAt the [user] prompt, enter a prompt such as Huawei is; use /exit or
/quit to leave the interactive session.
Start the serving server with multiprocess worker:
python -m python.cli.main \
--config examples/model/qwen3_14b/npu_serving.json \
--serve --port 8899 --device {}Test with curl:
# Health check
curl http://localhost:8899/health
# Completion
curl http://localhost:8899/v1/completions \
-H "Content-Type: application/json" \
-d '{"prompt": "Huawei is", "max_tokens": 32, "temperature": 0.0}'
# Streaming
curl http://localhost:8899/v1/completions \
-H "Content-Type: application/json" \
-d '{"prompt": "Huawei is", "max_tokens": 32, "stream": true}'
# Chat completion
curl http://localhost:8899/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"messages": [{"role": "user", "content": "What is 1+1?"}], "max_tokens": 32}'Run the serving benchmark:
python tests/bench_serving.py --port 8899 --stream -n 8 -c 4 --max-tokens 16- The sample config points at
/data/linyifan/models/Qwen3-14B; editexamples/model/qwen3_14b/npu_serving.jsonor pass another config if your model path differs. ./examples/pypto-serving --device <id>overridesnpu.device_idfrom the JSON config for both one-shot and interactive serving.- Generated kernel artifacts are written under
build_output/and are ignored by git. - This repository expects PyPTO, CANN, torch, safetensors, transformers, and the local Ascend runtime environment to be available in the active Python environment.
- HTTP serving mode additionally requires
fastapi,uvicorn, andpydantic. The benchmark script requiresaiohttp.