Add comprehensive NumPy comparison tests for GPU kernels

## Summary

Current test coverage for GPU kernel operations is incomplete. Most operations lack explicit NumPy comparison tests to verify correctness.

## Current Coverage

### ✅ Tested (vs NumPy)
| Category | Operations | Test File |
|----------|-----------|-----------|
| Elementwise | `add`, `mul` | test_ops.py |
| Matmul | `matmul`, `matmul_tiled` | test_ops.py |
| TF32 | `matmul` (TF32 mode) | test_tf32_api.py |

### ❌ Missing NumPy Tests

| Category | Operations |
|----------|-----------|
| **Elementwise** | `sub`, `div`, `add_inplace`, `mul_inplace`, `copy_to` |
| **Unary** | `exp`, `log`, `relu` |
| **Reduction** | `sum`, `mean`, `max`, `softmax` |
| **NN** | `gelu`, `silu`, `layernorm`, `rmsnorm` |
| **Matmul** | `batched_matmul`, `transpose`, `linear_bias_gelu` |
| **FP8/NVF4** | `matmul_fp8*`, `gemv_*`, `quantize_*` (SM-dependent) |
| **Tensor** | `concat_axis0`, `repeat_interleave_axis1`, `transpose_3d_021`, `cast_*` |

## Proposed Test Structure

```python
class TestSubOperation:
    def test_sub_basic(self):
        a_np = np.random.rand(1024).astype(np.float32)
        b_np = np.random.rand(1024).astype(np.float32)
        a, b = gp.from_numpy(a_np), gp.from_numpy(b_np)
        result = gp.sub(a, b).to_numpy()
        np.testing.assert_array_almost_equal(result, a_np - b_np)

class TestExpOperation:
    def test_exp_basic(self):
        x_np = np.random.rand(1024).astype(np.float32)
        x = gp.from_numpy(x_np)
        result = gp.exp(x).to_numpy()
        np.testing.assert_array_almost_equal(result, np.exp(x_np), decimal=5)

class TestSoftmaxOperation:
    def test_softmax_1d(self):
        x_np = np.random.rand(128).astype(np.float32)
        x = gp.from_numpy(x_np)
        result = gp.softmax(x).to_numpy()
        expected = scipy.special.softmax(x_np)
        np.testing.assert_array_almost_equal(result, expected, decimal=5)
```

## Acceptance Criteria

- [ ] All elementwise operations (`sub`, `div`, `add_inplace`, `mul_inplace`, `copy_to`) have NumPy tests
- [ ] All unary operations (`exp`, `log`, `relu`) have NumPy tests
- [ ] All reduction operations (`sum`, `mean`, `max`, `softmax`) have NumPy tests
- [ ] NN operations (`gelu`, `silu`, `layernorm`, `rmsnorm`) have NumPy/SciPy reference tests
- [ ] Tensor operations (`concat`, `transpose`, `cast`) have NumPy tests
- [ ] Tests use appropriate tolerances for floating-point comparison
- [ ] SM-dependent operations (FP8/NVF4) use `pytest.mark.skipif` for hardware availability

## Notes

- Use `np.testing.assert_array_almost_equal` with appropriate `decimal` parameter
- For operations like `softmax`, use `scipy.special.softmax` as reference
- For `layernorm`/`rmsnorm`, implement NumPy reference manually
- FP8/NVF4 tests should skip on unsupported hardware (SM < 90/120)

Category	Operations
Elementwise	`sub`, `div`, `add_inplace`, `mul_inplace`, `copy_to`
Unary	`exp`, `log`, `relu`
Reduction	`sum`, `mean`, `max`, `softmax`
NN	`gelu`, `silu`, `layernorm`, `rmsnorm`
Matmul	`batched_matmul`, `transpose`, `linear_bias_gelu`
FP8/NVF4	`matmul_fp8`, `gemv_`, `quantize_*` (SM-dependent)
Tensor	`concat_axis0`, `repeat_interleave_axis1`, `transpose_3d_021`, `cast_*`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add comprehensive NumPy comparison tests for GPU kernels #186

Summary

Current Coverage

✅ Tested (vs NumPy)

❌ Missing NumPy Tests

Proposed Test Structure

Acceptance Criteria

Notes

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Category	Operations	Test File
Elementwise	`add`, `mul`	test_ops.py
Matmul	`matmul`, `matmul_tiled`	test_ops.py
TF32	`matmul` (TF32 mode)	test_tf32_api.py

Add comprehensive NumPy comparison tests for GPU kernels #186

Description

Summary

Current Coverage

✅ Tested (vs NumPy)

❌ Missing NumPy Tests

Proposed Test Structure

Acceptance Criteria

Notes

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions