Releases: modelscope/FunASR
Releases · modelscope/FunASR
v1.3.10
FunASR v1.3.10
New features
- Agent-friendly CLI:
funasr audio.wav --output-format jsonfor structured output - Fun-ASR-Nano: batched VAD-segment decoding (~1.75× faster) (#2979)
- WebSocket 2-pass server: sentence-level timestamps
- serve_vllm.py: new
--vad-model/--spk-modelflags
Fixes
- Fun-ASR-Nano: bf16/fp16 inference no longer crashes; warn on degraded fp16 (#2980)
- Fun-ASR-Nano vLLM: fix CUDA crash from
repetition_penalty - CLI: valid SRT timestamps + correct JSON durations (#2982); use
sentence_infotext (#2983); correct model idFun-ASR-Nano-2512(#2984) - Clearer error for missing audio path (#2981); respect explicit VAD silence threshold; handle
Noneencoder/scheduler configs
Docs
- New CLI reference; clearer vLLM install guidance
Full changelog: v1.3.9...v1.3.10
v1.3.9: Wheel packaging + SenseVoice speaker diarization fix
What's New
Wheel packaging (fixes #2943)
FunASR now publishes a py3-none-any wheel alongside the source distribution. Installation is faster since pip no longer needs to build from source.
Bug fixes
- SenseVoice + speaker diarization: Fixed crash when using
spk_model="cam++"with SenseVoice (auto-falls back to VAD-segment mode since SenseVoice doesn't produce word-level timestamps) - torchaudio >= 2.11 compatibility: Added
soundfileas intermediate fallback for users with newer torchaudio versions that removed legacy backends
Install / Upgrade
pip install --upgrade funasrFull changelog: v1.3.3...v1.3.9
v1.3.3: Agent Integration — OpenAI API + MCP Server + funasr-server CLI
Highlights
This release makes FunASR a drop-in speech backend for AI agents.
New: funasr-server CLI
pip install funasr fastapi uvicorn python-multipart
funasr-server --device cudaOne command starts an OpenAI-compatible /v1/audio/transcriptions endpoint.
New: MCP Server
AI assistants (Claude, Cursor, Windsurf) can now transcribe audio directly.
New: OpenAI-Compatible API
Works with any agent framework: LangChain, AutoGen, CrewAI, Dify, Flowise, Open WebUI.
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="x")
result = client.audio.transcriptions.create(model="sensevoice", file=open("a.wav","rb"))Bug Fixes
- Fixed
hub="hf"parameter propagation to sub-models (v1.3.2) - Fixed Qwen3-ASR ImportError masking
Upgrade
pip install --upgrade funasrLinks
v1.3.2: HuggingFace Hub Fix + Performance Benchmark
What's New
Bug Fix
- Fixed hub parameter propagation — When using
hub="hf", the parameter is now correctly forwarded to VAD/PUNC/SPK sub-models. Previously, users on HuggingFace would get 404 errors for sub-models. (#2859)
Improvements
- Updated PyPI metadata with better description, keywords, and project URLs
- Added comprehensive benchmark page: https://modelscope.github.io/FunASR/benchmark.html
Benchmark Results (PyTorch, GPU)
| Model | Type | Speed |
|---|---|---|
| SenseVoice-Small | NAR | 170x realtime |
| Paraformer-Large | NAR | 120x realtime |
| Whisper-large-v3-turbo | AR | 46x realtime |
| Fun-ASR-Nano | LLM | 17x realtime |
| Whisper-large-v3 | AR | 13.4x realtime |
Install / Upgrade
pip install --upgrade funasrQuick Start
from funasr import AutoModel
model = AutoModel(model="FunAudioLLM/SenseVoiceSmall", hub="hf", vad_model="funasr/fsmn-vad", device="cuda")
result = model.generate(input="audio.wav")0.3.0
What's new:
2023.3.17, funasr-0.3.0, modelscope-1.4.1
- New Features:
- Added support for GPU runtime solution, nv-triton, which allows easy export of Paraformer models from ModelScope and deployment as services. We conducted benchmark tests on a single GPU-V100, and achieved an RTF of 0.0032 and a speedup of 300.
- Added support for CPU runtime quantization solution, which supports export of quantized ONNX and Libtorch models from ModelScope. We conducted benchmark tests on a CPU-8369B, and found that RTF increased by 50% (0.00438 -> 0.00226) and double speedup (228 -> 442).
- Added support for C++ version of the gRPC service deployment solution. The C++ version of ONNXRuntime and quantization solution, provides double higher efficiency compared to the Python runtime, demo.
- Added streaming inference pipeline to the 16k VAD model, 8k VAD model, with support for audio input streams (>= 10ms) , demo.
- Improved the punctuation prediction model, resulting in increased accuracy (F-score increased from 55.6 to 56.5).
- Added real-time subtitle example based on gRPC service, using a 2-pass recognition model. Paraformer streaming model is used to output text in real time, while Paraformer-large offline model is used to correct recognition results, demo.
- New Models:
- Added 16k Paraformer streaming model, which supports real-time speech recognition with streaming audio input, demo. It can be deployed using the gRPC service to implement real-time subtitle function.
- Added streaming punctuation model, which supports real-time punctuation marking in streaming speech recognition scenarios, with real-time calls based on VAD points. It can be used along with real-time ASR models to achieve readable real-time subtitle function, demo.
- Added TP-Aligner timestamp model, which takes audio and corresponding text as input and outputs word-level timestamps. Its performance is comparable to that of the Kaldi FA model (60.3ms vs. 69.3ms). It can be combined freely with ASR models, demo.
- Added financial domain model (8k Paraformer-large-3445vocab), which is fine-tuned using 1000 hours of data. The recognition accuracy on the financial domain test set increased by 5%, and the recall rate of domain keywords increased by 7%.
- Added audio-visual domain model (16k Paraformer-large-3445vocab), which is fine-tuned using 10,000 hours of data. The recognition accuracy on the audio-visual domain test set increased by 8%.
- Added 8k speaker verification model, which can be used for speaker embedding extraction.
- Added speaker diarization models, including 16k SOND Chinese model, 8k SOND English model, which achieved the best performance on AliMeeting and Callhome with a DER of 4.46% and 11.13%, respectively.
- Added UniASR streaming offline unifying models, including 16k UniASR Burmese, 16k UniASR Hebrew, 16k UniASR Urdu, 8k UniASR Mandarin financial domain, and 16k UniASR Mandarin audio-visual domain.
最新更新:
- 2023年3月17日:funasr-0.3.0, modelscope-1.4.1
- 功能完善:
- 新增GPU runtime方案,nv-triton,可以将modelscope中Paraformer模型便捷导出,并部署成triton服务,实测,单GPU-V100,RTF为0.0032,吞吐率为300,benchmark。
- 新增CPU runtime量化方案,支持从modelscope导出量化版本onnx与libtorch,实测,CPU-8369B,量化后,RTF提升50%(0.00438->0.00226),吞吐率翻倍(228->442),benchmark。
- 新增加C++版本grpc服务部署方案,配合C++版本onnxruntime,以及量化方案,相比python-runtime性能翻倍。
- 16k VAD模型,8k VAD模型,modelscope pipeline,新增加流式推理方式,,最小支持10ms语音输入流,用法。
- 优化标点预测模型,主观体验标点准确性提升(fscore绝对提升 55.6->56.5)。
- 基于grpc服务,新增实时字幕demo,采用2pass识别模型,Paraformer流式模型 用来上屏,Paraformer-large离线模型用来纠正识别结果。
- 上线新模型:
- 16k Paraformer流式模型,支持语音流输入,可以进行实时语音识别,用法。支持基于grpc服务进行部署,可实现实时字幕功能。
- 流式标点模型,支持流式语音识别场景中的标点打标,以VAD点为实时调用点进行流式调用。可与实时ASR模型配合使用,实现具有可读性的实时字幕功能,用法
- TP-Aligner时间戳模型,输入音频及对应文本输出字级别时间戳,效果与Kaldi FA模型相当(60.3ms v.s. 69.3ms),支持与asr模型自由组合,用法。
- 金融领域模型,8k Paraformer-large-3445vocab,使用1000小时数据微调训练,金融领域测试集识别效果相对提升5%,领域关键词召回相对提升7%。
- 音视频领域模型,16k Paraformer-large-3445vocab,使用10000小时数据微调训练,音视频领域测试集识别效果相对提升8%。
- 8k说话人确认模型,CallHome数据集英文说话人确认模型,也可用于声纹特征提取。
- 说话人日志模型,16k SOND中文模型,8k SOND英文模型,在AliMeeting和Callhome上获得最优性能,DER分别为4.46%和11.13%。
- UniASR流式离线一体化模型:
16k UniASR缅甸语、 16k UniASR希伯来语、 16k UniASR乌尔都语、 8k UniASR中文金融领域、16k UniASR中文音视频领域。
- 功能完善:
New Contributors
- @dingbig made their first contribution in #147
- @yuekaizhang made their first contribution in #161
- @zhuz...
v0.2.0
What's new:
2023.2.17, funasr-0.2.0, modelscope-1.3.0
- We support a new feature, export paraformer models into onnx and torchscripts from modelscope. The local finetuned models are also supported.
- We support a new feature, onnxruntime, you could deploy the runtime without modelscope or funasr, for the paraformer-large model, the rtf of onnxruntime is 3x speedup(0.110->0.038) on cpu, details.
- We support a new feature, grpc, you could build the ASR service with grpc, by deploying the modelscope pipeline or onnxruntime.
- We release a new model paraformer-large-contextual, which supports the hotword customization based on the incentive enhancement, and improves the recall and precision of hotwords.
- We optimize the timestamp alignment of Paraformer-large-long, the prediction accuracy of timestamp is much improved, and achieving accumulated average shift (aas) of 74.7ms, details.
- We release a new model, 8k VAD model, which could predict the duration of none-silence speech. It could be freely integrated with any ASR models in modelscope.
- We release a new model, MFCCA, a multi-channel multi-speaker model which is independent of the number and geometry of microphones and supports Mandarin meeting transcription.
- We release several new UniASR model: Southern Fujian Dialect model, French model, German model, Vietnamese model, Persian model.
- We release a new model, paraformer-data2vec model, an unsupervised pretraining model on AISHELL-2, which is inited for paraformer model and then finetune on AISHEL-1.
- We release a new feature, the
VAD,ASRandPUNCmodels could be integrated freely, which could be models from modelscope, or the local finetine models. The demo. - We optimize punctuation common model, enhance the recall and precision, fix the badcases of missing punctuation marks.
- Various new types of audio input types are now supported by modelscope inference pipeline, including: mp3、flac、ogg、opus...
最新更新:
- 2023年2月(2月17号发布):funasr-0.2.0, modelscope-1.3.0
-
功能完善:
- 新增加模型导出功能,Modelscope中所有Paraformer模型与本地finetune模型,支持一键导出onnx格式模型与torchscripts格式模型,用于模型部署。
- 新增加Paraformer模型onnxruntime部署功能,无须安装Modelscope与FunASR,即可部署,cpu实测,onnxruntime推理速度提升近3倍(rtf: 0.110->0.038)。
- 新增加grpc服务功能,支持对Modelscope推理pipeline进行服务部署,也支持对onnxruntime进行服务部署。
- 优化Paraformer-large长音频模型时间戳,对badcase时间戳预测准确率有较大幅度提升,平均首尾时间戳偏移74.7ms,详见论文。
- 新增加任意VAD模型、ASR模型与标点模型自由组合功能,可以自由组合Modelscope中任意模型以及本地finetune后的模型进行推理,用法示例。
- 优化标点通用模型,增加标点召回和精度,修复缺少标点等问题。
- 新增加采样率自适应功能,任意输入采样率音频会自动匹配到模型采样率;新增加多种语音格式支持,如,mp3、flac、ogg、opus等。
-
上线新模型:
- Paraformer-large热词模型,可实现热词定制化,基于提供的热词列表,对热词进行激励增强,提升模型对热词的召回。
- MFCCA多通道多说话人识别模型,与西工大音频语音与语言处理研究组合作论文,一种基于多帧跨通道注意力机制的多通道语音识别模型。
- 8k语音端点检测VAD模型,可用于检测长语音片段中有效语音的起止时间点,支持流式输入,最小支持10ms语音输入流。
- UniASR流式离线一体化模型: 16k UniASR闽南语、 16k UniASR法语、 16k UniASR德语、 16k UniASR越南语、 16k UniASR波斯语。
- 基于Data2vec结构无监督预训练Paraformer模型,采用Data2vec无监督预训练初值模型,在AISHELL-1数据中finetune Paraformer模型。
-
New Contributors
- @zjc6666 made their first contribution in #35
- @lyblsgo made their first contribution in #37
- @lingyunfly made their first contribution in #42
- @fangd123 made their first contribution in #44
- @dyyzhmm made their first contribution in #48
- @R1ckShi made their first contribution in #50
- @chenmengzheAAA made their first contribution in #57
- @ZhihaoDU made their first contribution in #95
- @SWHL made their first contribution in #97
- @yufan-aslp made their first contribution in #105
- @magicharry made their first contribution in #119
Full Changelog: v0.1.6...v0.2.0
v0.1.6
Release Notes:
2023.1.16, funasr-0.1.6
- We release a new version model Paraformer-large-long, which integrate the VAD model, ASR,
Punctuation model and timestamp together. The model could take in several hours long inputs. - We release a new type model, VAD, which could predict the duration of none-silence speech. It could be freely integrated with any ASR models in Model Zoo.
- We release a new type model, Punctuation, which could predict the punctuation of ASR models's results. It could be freely integrated with any ASR models in Model Zoo.
- We release a new model, Data2vec, an unsupervised pretraining model which could be finetuned on ASR and other downstream tasks.
- We release a new model, Paraformer-Tiny, a lightweight Paraformer model which supports Mandarin command words recognition.
- We release a new type model, SV, which could extract speaker embeddings and further perform speaker verification on paired utterances. It will be supported for speaker diarization in the future version.
- We improve the pipeline of modelscope to speedup the inference, by integrating the process of build model into build pipeline.
- Various new types of audio input types are now supported by modelscope inference pipeline, including wav.scp, wav format, audio bytes, wave samples...
最新更新
- 2023年1月(1月16号发布):funasr-0.1.6, modelscope-1.2.0
- 上线新模型:
- Paraformer-large长音频模型,集成VAD、ASR、标点与时间戳功能,可直接对时长为数小时音频进行识别,并输出带标点文字与时间戳。
- 中文无监督预训练Data2vec模型,采用Data2vec结构,基于AISHELL-2数据的中文无监督预训练模型,支持ASR或者下游任务微调模型。
- 16k语音端点检测VAD模型,可用于检测长语音片段中有效语音的起止时间点。
- 中文标点预测通用模型,可用于语音识别模型输出文本的标点预测。
- 8K UniASR流式模型,8K UniASR模型,一种流式与离线一体化语音识别模型,进行流式语音识别的同时,能够以较低延时输出离线识别结果来纠正预测文本。
- Paraformer-large基于AISHELL-1微调模型、AISHELL-2微调模型,将Paraformer-large模型分别基于AISHELL-1与AISHELL-2数据微调。
- 说话人确认模型 ,可用于说话人确认,也可以用来做说话人特征提取。
- 小尺寸设备端Paraformer指令词模型,Paraformer-tiny指令词版本,使用小参数量模型支持指令词识别。
- 将原TensorFlow模型升级为Pytorch模型,进行推理,并支持微调定制,包括:
- 16K 模型:Paraformer中文、Paraformer-large中文、UniASR中文、UniASR-large中文、UniASR中文流式模型、UniASR方言、UniASR方言流式模型、UniASR日语、UniASR日语流式模型、UniASR印尼语、UniASR印尼语流式模型、UniASR葡萄牙语、UniASR葡萄牙语流式模型、UniASR英文、UniASR英文流式模型、UniASR俄语、UniASR俄语流式模型、UniASR韩语、UniASR韩语流式模型、UniASR西班牙语、UniASR西班牙语流式模型、UniASR粤语简体、UniASR粤语简体流式模型、
- 8K 模型:Paraformer中文、UniASR中文、UniASR中文流式模型
- 上线新模型:
New Contributors
- @nichongjia-2007 made their first contribution in #27
Full Changelog: v0.1.4...v0.1.6