Make row/colscale, select_bitmap more memory-friendly for CUDA#406
Make row/colscale, select_bitmap more memory-friendly for CUDA#406VidithM wants to merge 6 commits intoDrTimothyAldenDavis:dev2from
Conversation
VidithM
commented
Mar 7, 2025
- Defer memcpys from the input to output matrix in row/colscale, select_bitmap until we know whether we are using CUDA. If so, do the memcpy inside the kernel.
|
Oops... my last comment (just deleted) was a reference to LAGraph, not GraphBLAS ... I got mixed up on which package I was looking at. |
| int64_t *tmp_Ah = A->h ; | ||
| A->p = NULL ; | ||
| A->h = NULL ; | ||
| GB_OK (GB_dup_worker (&C, C_iso, A, false, ztype)) ; |
There was a problem hiding this comment.
I hadn't considered this option for using GB_dup_worker but I see from my code that it can work. However, it might be cleaner to revise GB_dup_worker, and pass in flags that disable the copy of specific parts of the matrix, like "bool Ap_dup" or something, for each component. I can pull in this PR for now, but a "// FIXME: revise this to use flags for each component" can be added here.
There was a problem hiding this comment.
That makes sense. It looks like this wouldn't be a huge change, I can make a follow-up PR for this.
|
Looks great overall, just some minor comments and tweaks above. |
|
GB_cuda_ek_slice needs some work. It assumes the input matrix is always the "A" matrix, and it has a specific type for the Ap array it's using: GB_Ap_TYPE. It's not GB_Aj_TYPE so that will break if Ap and Aj have different types. That's one reason why I'm still using plain int64_t for scalars, like the return value of GB_cuda_ek_slice_entry. We probably should use C++ templates here so I can call GB_cuda_ek_slice for other matrices, like "B". |
|
I see CUDA/template/GB_cuda_jit_AxB_dot3_dense_phase1.cuh is broken. I'm calling the ek_slice methods on the M matrix but internally I use the "A" matrix integers. |
|
This should be good now. I'll make a follow-up PR to add flags to |
Source/mxm/GB_colscale.c
Outdated
|
|
||
| if (A->p != NULL) | ||
| { | ||
| GB_memcpy (C->p, A->p, (anvec+1) * psize, nthreads_max) ; |
There was a problem hiding this comment.
We need to assert that C->p and A->p have the same integer type. Likewise C->h and A->h.
There was a problem hiding this comment.
Fixed - this is already known to be true from the results of GB_dup_worker, and we use the same assumption elsewhere, e.g. in the kernels. Should the same assert be there too?
|
Addressed feedback and synced with the latest dev2 changes |