kernel 'aten::bucketize.Tensor_out' not found.

### 🐛 Describe the bug

I'm moving some computations from kotlin to executorch involving the bucketize operator. 
The model exports (I have to set the `compile_config`) but  app crashes when running the model.
It looks like bucketize is missing so I have already implemented the tensor and scalar portable kernel and can open a PR.
To reproduce:
- Export the model
```python
import torch
from torch.export import export
from executorch.backends.xnnpack.partition.xnnpack_partitioner import XnnpackPartitioner
from executorch.exir import to_edge_transform_and_lower, EdgeCompileConfig
from torch import nn


class Bucketizator(nn.Module):
    def forward(self, x, boundaries):
        return torch.bucketize(x, boundaries)


inputs = (torch.randn(2, 2, 2), torch.randn(5))
model = Bucketizator()
exported_program = export(model, inputs)
eprogram = to_edge_transform_and_lower(exported_program, partitioner=[XnnpackPartitioner()],
                                       compile_config=EdgeCompileConfig(_core_aten_ops_exception_list=[
                                           torch.ops.aten.bucketize.Tensor])).to_executorch()

with open("model.pte", "wb") as file:
    file.write(eprogram.buffer)
```
- Run on Android
```kotlin
val module = Module.load(modelPath)

val x = Tensor.fromBlob(floatArrayOf(1.5f, 2.5f, 1.5f, 1.5f, 7f, 4.5f, 3.5f, 6f), longArrayOf(2, 2, 2))
val boundaries = Tensor.fromBlob(floatArrayOf(1f, 2f, 3f, 4f, 5f), longArrayOf(5))

val xValue1 = EValue.from(x)
val boundariesValue = EValue.from(boundaries)

val result = module.forward(xValue1, boundariesValue)[0].toTensor().dataAsLongArray
```
- Logcat
```
kernel 'aten::bucketize.Tensor_out' not found.
dtype: 6 | dim order: [
0,
1,
2,
]
dtype: 6 | dim order: [
0,
]
dtype: 4 | dim order: [
0,
1,
2,
]
dtype: 4 | dim order: [
0,
1,
2,
]
Missing operator: [0] aten::bucketize.Tensor_out
There are 1 instructions don't have corresponding operator registered. See logs for details
ptr

```

### Versions

```
Collecting environment information...
/home/username/miniforge3/envs/executorch/lib/python3.10/site-packages/torch/cuda/__init__.py:384: UserWarning: Found GPU0 NVIDIA GPU which is of compute capability (CC) 6.1.
The following list shows the CCs this version of PyTorch was built for and the hardware CCs it supports:
- 7.5 which supports hardware CC >=7.5,<8.0
- 8.0 which supports hardware CC >=8.0,<9.0 except {8.7}
- 8.6 which supports hardware CC >=8.6,<9.0 except {8.7}
- 9.0 which supports hardware CC >=9.0,<10.0
- 10.0 which supports hardware CC >=10.0,<11.0 except {10.1}
- 12.0 which supports hardware CC >=12.0,<13.0
Please follow the instructions at https://pytorch.org/get-started/locally/ to install a PyTorch release that supports one of these CUDA versions: 12.6
  _warn_unsupported_code(d, device_cc, code_ccs)
/home/username/miniforge3/envs/executorch/lib/python3.10/site-packages/torch/cuda/__init__.py:502: UserWarning: 
NVIDIA GPU with CUDA capability sm_61 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_75 sm_80 sm_86 sm_90 sm_100 sm_120.
If you want to use the NVIDIA GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

  queued_call()
PyTorch version: 2.12.0+cu130
Is debug build: False
CUDA used to build PyTorch: 13.0
ROCM used to build PyTorch: N/A

OS: Ubuntu 26.04 LTS (x86_64)
GCC version: (Ubuntu 15.2.0-16ubuntu1) 15.2.0
Clang version: 21.1.8 (6ubuntu1)
CMake version: version 4.3.3
Libc version: glibc-2.43

Python version: 3.10.20 | packaged by conda-forge | (main, Mar  5 2026, 16:42:22) [GCC 14.3.0] (64-bit runtime)
Python platform: Linux-7.0.0-22-generic-x86_64-with-glibc2.43
Is CUDA available: True
CUDA runtime version: 12.4.131
CUDA_MODULE_LOADING set to: 
GPU models and configuration: GPU 0: NVIDIA GPU
Nvidia driver version: 580.159.03
cuDNN version: Could not collect
Is XPU available: False
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Caching allocator config: N/A

CPU:
Architecture:                            x86_64
CPU op-mode(s):                          32-bit, 64-bit
Address sizes:                           43 bits physical, 48 bits virtual
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               AuthenticAMD
Model name:                              AMD Processor
CPU family:                              23
Model:                                   8
Thread(s) per core:                      2
Core(s) per socket:                      8
Socket(s):                               1
Stepping:                                2
Frequency boost:                         enabled
CPU(s) scaling MHz:                      72%
CPU max MHz:                             3700,0000
CPU min MHz:                             2200,0000
BogoMIPS:                                7398,71
Flags:                                   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate ssbd ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca sev sev_es
Virtualization:                          AMD-V
L1d cache:                               256 KiB (8 instances)
L1i cache:                               512 KiB (8 instances)
L2 cache:                                4 MiB (8 instances)
L3 cache:                                16 MiB (2 instances)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Ghostwrite:                Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Old microcode:             Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Mitigation; untrained return thunk; SMT vulnerable
Vulnerability Spec rstack overflow:      Mitigation; Safe RET
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; Retpolines; IBPB conditional; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Mitigation; IBPB before exit to userspace

Versions of relevant libraries:
[pip3] executorch==1.3.1
[pip3] flake8==6.1.0
[pip3] flake8-breakpoint==1.1.0
[pip3] flake8-bugbear==24.4.26
[pip3] flake8-comprehensions==3.14.0
[pip3] flake8-plugin-utils==1.3.3
[pip3] flake8-pyi==23.5.0
[pip3] mypy==1.14.1
[pip3] mypy_extensions==1.1.0
[pip3] numpy==2.2.6
[pip3] nvidia-cublas==13.1.1.3
[pip3] nvidia-cuda-cupti==13.0.85
[pip3] nvidia-cuda-nvrtc==13.0.88
[pip3] nvidia-cuda-runtime==13.0.96
[pip3] nvidia-cudnn-cu13==9.20.0.48
[pip3] nvidia-cufft==12.0.0.61
[pip3] nvidia-curand==10.4.0.35
[pip3] nvidia-cusolver==12.0.4.66
[pip3] nvidia-cusparse==12.6.3.3
[pip3] nvidia-cusparselt-cu13==0.8.1
[pip3] nvidia-nccl-cu13==2.29.7
[pip3] nvidia-nvjitlink==13.0.88
[pip3] nvidia-nvtx==13.0.85
[pip3] pytorch_tokenizers==1.3.0
[pip3] torch==2.12.0
[pip3] torchao==0.17.0+git02105d46c
[pip3] torchaudio==2.11.0+cpu
[pip3] torchdata==0.11.0+cpu
[pip3] torchsr==1.0.4
[pip3] torchtune==0.0.0
[pip3] torchvision==0.27.0+cpu
[pip3] triton==3.7.0
[conda] executorch                1.3.1                    pypi_0    pypi
[conda] numpy                     2.2.6                    pypi_0    pypi
[conda] nvidia-cublas             13.1.1.3                 pypi_0    pypi
[conda] nvidia-cuda-cupti         13.0.85                  pypi_0    pypi
[conda] nvidia-cuda-nvrtc         13.0.88                  pypi_0    pypi
[conda] nvidia-cuda-runtime       13.0.96                  pypi_0    pypi
[conda] nvidia-cudnn-cu13         9.20.0.48                pypi_0    pypi
[conda] nvidia-cufft              12.0.0.61                pypi_0    pypi
[conda] nvidia-curand             10.4.0.35                pypi_0    pypi
[conda] nvidia-cusolver           12.0.4.66                pypi_0    pypi
[conda] nvidia-cusparse           12.6.3.3                 pypi_0    pypi
[conda] nvidia-cusparselt-cu13    0.8.1                    pypi_0    pypi
[conda] nvidia-nccl-cu13          2.29.7                   pypi_0    pypi
[conda] nvidia-nvjitlink          13.0.88                  pypi_0    pypi
[conda] nvidia-nvtx               13.0.85                  pypi_0    pypi
[conda] pytorch-tokenizers        1.3.0                    pypi_0    pypi
[conda] torch                     2.12.0                   pypi_0    pypi
[conda] torchao                   0.17.0+git02105d46c          pypi_0    pypi
[conda] torchaudio                2.11.0+cpu               pypi_0    pypi
[conda] torchdata                 0.11.0+cpu               pypi_0    pypi
[conda] torchfix                  0.6.0                    pypi_0    pypi
[conda] torchsr                   1.0.4                    pypi_0    pypi
[conda] torchtune                 0.0.0                    pypi_0    pypi
[conda] torchvision               0.27.0+cpu               pypi_0    pypi
[conda] triton                    3.7.0                    pypi_0    pypi
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kernel 'aten::bucketize.Tensor_out' not found. #20270

🐛 Describe the bug

Versions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

kernel 'aten::bucketize.Tensor_out' not found. #20270

Description

🐛 Describe the bug

Versions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions