Skip to content

[OPT] Faster the Two-Way Attention Block for CPU/GPU#181

Open
IchiruTake wants to merge 1 commit into
ChaoningZhang:masterfrom
IchiruTake:opt/opt-twoway-attention-block
Open

[OPT] Faster the Two-Way Attention Block for CPU/GPU#181
IchiruTake wants to merge 1 commit into
ChaoningZhang:masterfrom
IchiruTake:opt/opt-twoway-attention-block

Conversation

@IchiruTake
Copy link
Copy Markdown

PR: Deduplicate keys + key_pe in TwoWayAttentionBlock

Summary:

Improve the TwoWayAttentionBlock by deduplicate the keys calculation.

Description:

In the forward method in the transformer.py file, the keys field is being calculated twice for both queries path (2 & 3) and keys path (4), with four consecutive forward call in MobileSAMv1 (tested on the bundled mobile_sam.pt file. The improvement is relatively modest, especially over the CPU.

Since for each image, it would generate an extra tensor of (N_image, 4096, 256) which cause some extra performance overhead.

Improvement:

It is tested by on Windows 11 with Ryzen 7 7435HS CPU and RTX 4060 Laptop (no overclock) with one initial warm-up. The test is to repeatedly call over 100 times over single image to simulate the latency (hopefully seeing an improvement on Linux as well).

Environment Before (seconds) After (seconds)
CPU 4.8498 4.3273 -> 11%
GPU 0.9705 0.8295 -> 15%

Changed Files:

  • ./mobile_sam/modeling/transformer.py

Behaviour Changes: NO

Reproduction Code

TwoWayAttentionBlock.forward — before:

model_type = "vit_t"
sam_checkpoint = "./weight/mobile_sam.pt"

mobile_sam = sam_model_registry[model_type](checkpoint=sam_checkpoint)
mobile_sam.to(device="cuda" if torch.cuda.is_available() else "cpu")
# mobile_sam.to(device="cpu")
mobile_sam.eval()

predictor = SamPredictor(mobile_sam)
image = cv2.imread('./test_images/3.jpeg')
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

predictor.set_image(image)

input_point = np.array([[800, 450]])
input_label = np.array([1])

# Warm-up
_, _, _ = predictor.predict(
    point_coords=input_point,
    point_labels=input_label,
    multimask_output=False,
)

t = perf_counter()
for i in range(100):
    _, _, _ = predictor.predict(
        point_coords=input_point,
        point_labels=input_label,
        multimask_output=False,
    )
print('Execution time: ', perf_counter() - t)

masks, scores, logits = predictor.predict(
    point_coords=input_point,
    point_labels=input_label,
    multimask_output=False,
)
print('Masks Shape:', masks.shape)
print('Scores Shape:', scores.shape)
print('Logits Shape:', logits.shape)

Copilot AI review requested due to automatic review settings May 5, 2026 08:34
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes TwoWayAttentionBlock.forward in mobile_sam/modeling/transformer.py by avoiding redundant computation of keys + key_pe, aiming to reduce extra tensor allocations and improve inference latency on CPU/GPU without changing model behavior.

Changes:

  • Hoists k = keys + key_pe to compute once and reuse across both cross-attention calls.
  • Updates inline comments to reflect reuse of the precomputed k.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +163 to +165
# Compute once and Reuse while `keys` is unchanged.
k = keys + key_pe

# Cross attention block, tokens attending to image embedding
q = queries + query_pe
k = keys + key_pe
# k = keys + key_pe # Re-use the `keys` as above.
# Cross attention block, image embedding attending to tokens
q = queries + query_pe
k = keys + key_pe
# k = keys + key_pe # Re-use the `keys` as above.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants