Skip to content

[Bug] DLight GEMV schedule rule crashes with KeyError on auto-detected CUDA target #19419

@wuyii8941

Description

@wuyii8941

Description

The DLight GEMV schedule rule crashes with KeyError: 'key is not in Map' when applied to any matrix-vector multiplication on an auto-detected CUDA target. The root cause is that gemv.py accesses target.attrs["max_shared_memory_per_block"] unconditionally, but the auto-detected CUDA target does not include this attribute.

Reproduction

import tvm
from tvm import relax
import tvm.relax.op as rop
import tvm.dlight

bb = relax.BlockBuilder()
a = relax.Var("a", relax.TensorStructInfo((128, 256), "float32"))
b = relax.Var("b", relax.TensorStructInfo((256, 1), "float32"))

with bb.function("main", [a, b]):
    with bb.dataflow():
        out = bb.emit(rop.matmul(a, b))
        gv = bb.emit_output(out)
    bb.emit_func_output(gv)

mod = bb.finalize()

pipeline = tvm.ir.transform.Sequential([
    relax.transform.LegalizeOps(),
    tvm.dlight.ApplyDefaultSchedule(
        tvm.dlight.gpu.GEMV(),
        tvm.dlight.gpu.Fallback(),
    ),
])

with tvm.target.Target("cuda"):
    mod = pipeline(mod)  # KeyError here

Error

File "tvm/dlight/gpu/gemv.py", line 161, in apply
    and shared_mem_usage.value <= target.max_shared_memory_per_block
File "tvm/target/target.py", line 217, in max_shared_memory_per_block
    return int(self.attrs["max_shared_memory_per_block"])
KeyError: 'key is not in Map'

Root cause

Target("cuda") auto-detects the GPU and produces:

cuda -keys=cuda,gpu -arch=sm_75 -max_num_threads=1024 -thread_warp_size=32

The attrs map contains only thread_warp_size, max_num_threads, and arch. The key max_shared_memory_per_block is not populated during auto-detection.

gemv.py:161 accesses this missing key without a guard:

and shared_mem_usage.value <= int(target.attrs["max_shared_memory_per_block"])

Note: the same pattern exists on the main branch at python/tvm/s_tir/dlight/gpu/gemv.py:162.

Expected behavior

The GEMV schedule rule should either:

  1. Have the CUDA target auto-detection populate max_shared_memory_per_block (queryable via CUDA API), or
  2. Guard the access with a fallback default, e.g.:
    max_smem = target.attrs.get("max_shared_memory_per_block", 49152)

Environment

  • TVM: v0.23.0 (also reproduced on main branch source)
  • GPU: Tesla T4 (sm_75)
  • OS: Ubuntu Linux
  • Python: 3.11

Metadata

Metadata

Assignees

No one assigned

    Labels

    needs-triagePRs or issues that need to be investigated by maintainers to find the right assignees to address ittype: bug

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions