Skip to content

Conversation

@muyuuuu
Copy link
Contributor

@muyuuuu muyuuuu commented Nov 25, 2024

  1. 避免内存泄漏
  2. 一次 IO,多次计算那里是我个人的理解,在之前 opencl 优化算子中也用到了,算一种通用思路吧
  3. 矩阵大小不为 32 的倍数时会有段错误。我见过的优化方法是先取出能整除 32 的图像区域进行 cuda 加速,边界部分用 C 处理。我不确定百度的优化方法,就没写解决方案
  4. TM 的解释,我看了好久看懂了 TM 的用法,擅自主张加了个解释

Copy link
Collaborator

@AndSonder AndSonder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@AndSonder
Copy link
Collaborator

感谢对文档的补充!

“矩阵大小不为 32 的倍数时会有段错误” 这个问题其实在很多sgemm优化算法里面都会遇到,文档里面的文章主要还是学习这些优化方法,考虑特别多边界情况的话会让代码变的非常复杂

@AndSonder AndSonder merged commit 2d37c86 into PaddleJitLab:develop Nov 26, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants