# 启动日志
(FunASR) ➜ FunASR git:(main) ✗ CUDA_VISIBLE_DEVICES=7 python examples/industrial_data_pretraining/fun_asr_nano/serve_realtime_ws.py
Downloading Model from https://www.modelscope.cn to directory: /home/xty/.cache/modelscope/hub/models/FunAudioLLM/Fun-ASR-Nano-2512
2026-06-16 00:36:45,686 [INFO] vLLM model already prepared at /home/xty/.cache/modelscope/hub/models/FunAudioLLM/Fun-ASR-Nano-2512/Qwen3-0.6B-vllm
2026-06-16 00:36:47,332 [INFO] Loading audio component weights from /home/xty/.cache/modelscope/hub/models/FunAudioLLM/Fun-ASR-Nano-2512/model.pt
2026-06-16 00:36:49,279 [INFO] Loaded audio_encoder: 914 params
2026-06-16 00:36:49,285 [INFO] Loaded audio_adaptor: 36 params
2026-06-16 00:36:49,930 [INFO] Initializing vLLM with model: /home/xty/.cache/modelscope/hub/models/FunAudioLLM/Fun-ASR-Nano-2512/Qwen3-0.6B-vllm
2026-06-16 00:36:49,931 [INFO] tensor_parallel_size=1
2026-06-16 00:36:49,931 [INFO] gpu_memory_utilization=0.8
INFO 06-16 00:36:49 [utils.py:233] non-default args: {'enable_prompt_embeds': True, 'trust_remote_code': True, 'dtype': 'float16', 'max_model_len': 4096, 'gpu_memory_utilization': 0.8, 'disable_log_stats': True, 'model': '/home/xty/.cache/modelscope/hub/models/FunAudioLLM/Fun-ASR-Nano-2512/Qwen3-0.6B-vllm'}
INFO 06-16 00:36:49 [model.py:549] Resolved architecture: Qwen3ForCausalLM
WARNING 06-16 00:36:49 [model.py:2016] Casting torch.bfloat16 to torch.float16.
INFO 06-16 00:36:49 [model.py:1678] Using max model len 4096
INFO 06-16 00:36:49 [scheduler.py:238] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 06-16 00:36:49 [vllm.py:790] Asynchronous scheduling is enabled.
WARNING 06-16 00:36:51 [system_utils.py:152] We must use the `spawn` multiprocessing start method. Overriding VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See https://docs.vllm.ai/en/latest/usage/troubleshooting.html#python-multiprocessing for more information. Reasons: CUDA is initialized
(EngineCore pid=32072) INFO 06-16 00:37:00 [core.py:105] Initializing a V1 LLM engine (v0.19.1) with config: model='/home/xty/.cache/modelscope/hub/models/FunAudioLLM/Fun-ASR-Nano-2512/Qwen3-0.6B-vllm', speculative_config=None, tokenizer='/home/xty/.cache/modelscope/hub/models/FunAudioLLM/Fun-ASR-Nano-2512/Qwen3-0.6B-vllm', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=/home/xty/.cache/modelscope/hub/models/FunAudioLLM/Fun-ASR-Nano-2512/Qwen3-0.6B-vllm, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_images_per_batch': 0, 'compile_sizes': [], 'compile_ranges_endpoints': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore pid=32072) INFO 06-16 00:37:01 [parallel_state.py:1400] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.16.101.18:39793 backend=nccl
(EngineCore pid=32072) INFO 06-16 00:37:01 [parallel_state.py:1716] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(EngineCore pid=32072) INFO 06-16 00:37:02 [gpu_model_runner.py:4735] Starting to load model /home/xty/.cache/modelscope/hub/models/FunAudioLLM/Fun-ASR-Nano-2512/Qwen3-0.6B-vllm...
(EngineCore pid=32072) ERROR 06-16 00:37:02 [fa_utils.py:145] Cannot use FA version 2 is not supported due to FA2 is only supported on devices with compute capability >= 8
(EngineCore pid=32072) INFO 06-16 00:37:02 [cuda.py:334] Using TRITON_ATTN attention backend out of potential backends: ['TRITON_ATTN', 'FLEX_ATTENTION'].
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 2.33it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 2.33it/s]
(EngineCore pid=32072)
(EngineCore pid=32072) INFO 06-16 00:37:03 [default_loader.py:384] Loading weights took 0.44 seconds
(EngineCore pid=32072) INFO 06-16 00:37:04 [gpu_model_runner.py:4820] Model loading took 1.12 GiB memory and 0.992266 seconds
(EngineCore pid=32072) INFO 06-16 00:37:07 [backends.py:1051] Using cache directory: /home/xty/.cache/vllm/torch_compile_cache/fe9c261fb7/rank_0_0/backbone for vLLM's torch.compile
(EngineCore pid=32072) INFO 06-16 00:37:07 [backends.py:1111] Dynamo bytecode transform time: 3.07 s
(EngineCore pid=32072) INFO 06-16 00:37:09 [backends.py:285] Directly load the compiled graph(s) for compile range (1, 8192) from the cache, took 0.973 s
(EngineCore pid=32072) INFO 06-16 00:37:09 [decorators.py:305] Directly load AOT compilation from path /home/xty/.cache/vllm/torch_compile_cache/torch_aot_compile/f4c695bdd811bfcaa114b11fad47b26ba0df24f513a51e5635e73f77d4ca426e/rank_0_0/model
(EngineCore pid=32072) INFO 06-16 00:37:09 [monitor.py:48] torch.compile took 4.72 s in total
(EngineCore pid=32072) INFO 06-16 00:37:09 [monitor.py:76] Initial profiling/warmup run took 0.08 s
(EngineCore pid=32072) INFO 06-16 00:37:09 [kv_cache_utils.py:829] Overriding num_gpu_blocks=0 with num_gpu_blocks_override=512
(EngineCore pid=32072) INFO 06-16 00:37:09 [gpu_model_runner.py:5876] Profiling CUDA graph memory: PIECEWISE=51 (largest=512), FULL=35 (largest=256)
(EngineCore pid=32072) INFO 06-16 00:37:11 [gpu_model_runner.py:5955] Estimated CUDA graph memory: 0.47 GiB total
(EngineCore pid=32072) INFO 06-16 00:37:11 [gpu_worker.py:436] Available KV cache memory: 23.55 GiB
(EngineCore pid=32072) INFO 06-16 00:37:11 [gpu_worker.py:470] In v0.19, CUDA graph memory profiling will be enabled by default (VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1), which more accurately accounts for CUDA graph memory during KV cache allocation. To try it now, set VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 and increase --gpu-memory-utilization from 0.8000 to 0.8150 to maintain the same effective KV cache size.
(EngineCore pid=32072) INFO 06-16 00:37:11 [kv_cache_utils.py:1319] GPU KV cache size: 220,448 tokens
(EngineCore pid=32072) INFO 06-16 00:37:11 [kv_cache_utils.py:1324] Maximum concurrency for 4,096 tokens per request: 53.82x
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 51/51 [00:01<00:00, 25.76it/s]
Capturing CUDA graphs (decode, FULL): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:01<00:00, 27.55it/s]
(EngineCore pid=32072) INFO 06-16 00:37:15 [gpu_model_runner.py:6046] Graph capturing finished in 4 secs, took 0.42 GiB
(EngineCore pid=32072) INFO 06-16 00:37:15 [gpu_worker.py:597] CUDA graph pool memory: 0.42 GiB (actual), 0.47 GiB (estimated), difference: 0.06 GiB (13.6%).
(EngineCore pid=32072) INFO 06-16 00:37:15 [core.py:283] init engine (profile, create kv cache, warmup model) took 11.46 seconds
(EngineCore pid=32072) INFO 06-16 00:37:16 [vllm.py:790] Asynchronous scheduling is enabled.
2026-06-16 00:37:17,681 [INFO] Loaded embedding layer: torch.Size([151936, 1024])
2026-06-16 00:37:17,956 [INFO] Audio encoding: 1 samples in 0.255s
2026-06-16 00:37:19,927 [INFO] vLLM generation: 1.971s
[{'key': 'long', 'text': ''}]
[long]
(EngineCore pid=32072) INFO 06-16 00:37:20 [core.py:1210] Shutdown initiated (timeout=0)
(EngineCore pid=32072) INFO 06-16 00:37:20 [core.py:1233] Shutdown complete
[rank0]:[W616 00:37:20.703451303 ProcessGroupNCCL.cpp:1553] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
Before asking
Question
运行FunASR官网中离线 SDK 推理的示例代码没有识别出语音结果。
Environment
cuda,cpu,mps): cudaPython 3.12.13
vllm 0.19.1
torch 2.10.0+cu126
torch-c-dlpack-ext 0.1.5
torch-complex 0.4.4
torchaudio 2.10.0+cu126
torchvision 0.25.0+cu126
cuda-bindings 12.9.4
cuda-core 1.0.1
cuda-pathfinder 1.5.5
cuda-python 12.9.4
cuda-tile 1.3.0
cuda-toolkit 12.6.3
nvidia-cuda-cccl 13.3.3.3.1
nvidia-cuda-crt 13.3.33
nvidia-cuda-cupti 13.0.85
nvidia-cuda-cupti-cu12 12.6.80
nvidia-cuda-nvcc 13.2.78
nvidia-cuda-nvrtc 13.3.33
nvidia-cuda-nvrtc-cu12 12.6.77
nvidia-cuda-runtime 13.3.29
nvidia-cuda-runtime-cu12 12.6.77
nvidia-cuda-tileiras 13.2.78