[OPT] Faster the Two-Way Attention Block for CPU/GPU#181
Open
IchiruTake wants to merge 1 commit into
Open
Conversation
There was a problem hiding this comment.
Pull request overview
This PR optimizes TwoWayAttentionBlock.forward in mobile_sam/modeling/transformer.py by avoiding redundant computation of keys + key_pe, aiming to reduce extra tensor allocations and improve inference latency on CPU/GPU without changing model behavior.
Changes:
- Hoists
k = keys + key_peto compute once and reuse across both cross-attention calls. - Updates inline comments to reflect reuse of the precomputed
k.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+163
to
+165
| # Compute once and Reuse while `keys` is unchanged. | ||
| k = keys + key_pe | ||
|
|
| # Cross attention block, tokens attending to image embedding | ||
| q = queries + query_pe | ||
| k = keys + key_pe | ||
| # k = keys + key_pe # Re-use the `keys` as above. |
| # Cross attention block, image embedding attending to tokens | ||
| q = queries + query_pe | ||
| k = keys + key_pe | ||
| # k = keys + key_pe # Re-use the `keys` as above. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR: Deduplicate
keys + key_peinTwoWayAttentionBlockSummary:
Improve the
TwoWayAttentionBlockby deduplicate thekeyscalculation.Description:
In the
forwardmethod in thetransformer.pyfile, thekeysfield is being calculated twice for bothqueriespath (2 & 3) andkeyspath (4), with four consecutiveforwardcall in MobileSAMv1 (tested on the bundledmobile_sam.ptfile. The improvement is relatively modest, especially over the CPU.Since for each image, it would generate an extra tensor of
(N_image, 4096, 256)which cause some extra performance overhead.Improvement:
It is tested by on Windows 11 with Ryzen 7 7435HS CPU and RTX 4060 Laptop (no overclock) with one initial warm-up. The test is to repeatedly call over 100 times over single image to simulate the latency (hopefully seeing an improvement on Linux as well).
Changed Files:
./mobile_sam/modeling/transformer.pyBehaviour Changes: NO
Reproduction Code
TwoWayAttentionBlock.forward— before: