[Bug] DLight GEMV schedule rule crashes with KeyError on auto-detected CUDA target

## Description

The DLight `GEMV` schedule rule crashes with `KeyError: 'key is not in Map'` when applied to any matrix-vector multiplication on an auto-detected CUDA target. The root cause is that `gemv.py` accesses `target.attrs["max_shared_memory_per_block"]` unconditionally, but the auto-detected CUDA target does not include this attribute.

## Reproduction

```python
import tvm
from tvm import relax
import tvm.relax.op as rop
import tvm.dlight

bb = relax.BlockBuilder()
a = relax.Var("a", relax.TensorStructInfo((128, 256), "float32"))
b = relax.Var("b", relax.TensorStructInfo((256, 1), "float32"))

with bb.function("main", [a, b]):
    with bb.dataflow():
        out = bb.emit(rop.matmul(a, b))
        gv = bb.emit_output(out)
    bb.emit_func_output(gv)

mod = bb.finalize()

pipeline = tvm.ir.transform.Sequential([
    relax.transform.LegalizeOps(),
    tvm.dlight.ApplyDefaultSchedule(
        tvm.dlight.gpu.GEMV(),
        tvm.dlight.gpu.Fallback(),
    ),
])

with tvm.target.Target("cuda"):
    mod = pipeline(mod)  # KeyError here
```

## Error

```
File "tvm/dlight/gpu/gemv.py", line 161, in apply
    and shared_mem_usage.value <= target.max_shared_memory_per_block
File "tvm/target/target.py", line 217, in max_shared_memory_per_block
    return int(self.attrs["max_shared_memory_per_block"])
KeyError: 'key is not in Map'
```

## Root cause

`Target("cuda")` auto-detects the GPU and produces:

```
cuda -keys=cuda,gpu -arch=sm_75 -max_num_threads=1024 -thread_warp_size=32
```

The attrs map contains only `thread_warp_size`, `max_num_threads`, and `arch`. The key `max_shared_memory_per_block` is not populated during auto-detection.

`gemv.py:161` accesses this missing key without a guard:
```python
and shared_mem_usage.value <= int(target.attrs["max_shared_memory_per_block"])
```

Note: the same pattern exists on the `main` branch at `python/tvm/s_tir/dlight/gpu/gemv.py:162`.

## Expected behavior

The GEMV schedule rule should either:
1. Have the CUDA target auto-detection populate `max_shared_memory_per_block` (queryable via CUDA API), or
2. Guard the access with a fallback default, e.g.:
   ```python
   max_smem = target.attrs.get("max_shared_memory_per_block", 49152)
   ```

## Environment

- TVM: v0.23.0 (also reproduced on `main` branch source)
- GPU: Tesla T4 (sm_75)
- OS: Ubuntu Linux
- Python: 3.11


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] DLight GEMV schedule rule crashes with KeyError on auto-detected CUDA target #19419

Description

Reproduction

Error

Root cause

Expected behavior

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Bug] DLight GEMV schedule rule crashes with KeyError on auto-detected CUDA target #19419

Description

Description

Reproduction

Error

Root cause

Expected behavior

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions