[OPT] Faster the Two-Way Attention Block for CPU/GPU by IchiruTake · Pull Request #181 · ChaoningZhang/MobileSAM

IchiruTake · 2026-05-05T08:34:55Z

PR: Deduplicate `keys + key_pe` in `TwoWayAttentionBlock`

Summary:

Improve the TwoWayAttentionBlock by deduplicate the keys calculation.

Description:

In the forward method in the transformer.py file, the keys field is being calculated twice for both queries path (2 & 3) and keys path (4), with four consecutive forward call in MobileSAMv1 (tested on the bundled mobile_sam.pt file. The improvement is relatively modest, especially over the CPU.

Since for each image, it would generate an extra tensor of (N_image, 4096, 256) which cause some extra performance overhead.

Improvement:

It is tested by on Windows 11 with Ryzen 7 7435HS CPU and RTX 4060 Laptop (no overclock) with one initial warm-up. The test is to repeatedly call over 100 times over single image to simulate the latency (hopefully seeing an improvement on Linux as well).

Environment	Before (seconds)	After (seconds)
CPU	4.8498	4.3273 -> 11%
GPU	0.9705	0.8295 -> 15%

Changed Files:

./mobile_sam/modeling/transformer.py

Behaviour Changes: NO

Reproduction Code

TwoWayAttentionBlock.forward — before:

model_type = "vit_t"
sam_checkpoint = "./weight/mobile_sam.pt"

mobile_sam = sam_model_registry[model_type](checkpoint=sam_checkpoint)
mobile_sam.to(device="cuda" if torch.cuda.is_available() else "cpu")
# mobile_sam.to(device="cpu")
mobile_sam.eval()

predictor = SamPredictor(mobile_sam)
image = cv2.imread('./test_images/3.jpeg')
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

predictor.set_image(image)

input_point = np.array([[800, 450]])
input_label = np.array([1])

# Warm-up
_, _, _ = predictor.predict(
    point_coords=input_point,
    point_labels=input_label,
    multimask_output=False,
)

t = perf_counter()
for i in range(100):
    _, _, _ = predictor.predict(
        point_coords=input_point,
        point_labels=input_label,
        multimask_output=False,
    )
print('Execution time: ', perf_counter() - t)

masks, scores, logits = predictor.predict(
    point_coords=input_point,
    point_labels=input_label,
    multimask_output=False,
)
print('Masks Shape:', masks.shape)
print('Scores Shape:', scores.shape)
print('Logits Shape:', logits.shape)

Copilot

Pull request overview

This PR optimizes TwoWayAttentionBlock.forward in mobile_sam/modeling/transformer.py by avoiding redundant computation of keys + key_pe, aiming to reduce extra tensor allocations and improve inference latency on CPU/GPU without changing model behavior.

Changes:

Hoists k = keys + key_pe to compute once and reuse across both cross-attention calls.
Updates inline comments to reflect reuse of the precomputed k.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+        # Compute once and Reuse while `keys` is unchanged.
+        k = keys + key_pe
+


        # Cross attention block, tokens attending to image embedding
        q = queries + query_pe
-        k = keys + key_pe
+        # k = keys + key_pe   # Re-use the `keys` as above. 


        # Cross attention block, image embedding attending to tokens
        q = queries + query_pe
-        k = keys + key_pe
+        # k = keys + key_pe     # Re-use the `keys` as above. 


Update transformer.py

6a0024c

Copilot AI review requested due to automatic review settings May 5, 2026 08:34

Copilot started reviewing on behalf of IchiruTake May 5, 2026 08:35 View session

Copilot AI reviewed May 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[OPT] Faster the Two-Way Attention Block for CPU/GPU#181

[OPT] Faster the Two-Way Attention Block for CPU/GPU#181
IchiruTake wants to merge 1 commit into
ChaoningZhang:masterfrom
IchiruTake:opt/opt-twoway-attention-block

IchiruTake commented May 5, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		# Compute once and Reuse while `keys` is unchanged.
		k = keys + key_pe

Conversation

IchiruTake commented May 5, 2026

PR: Deduplicate keys + key_pe in TwoWayAttentionBlock

Summary:

Description:

Changed Files:

Behaviour Changes: NO

Reproduction Code

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

PR: Deduplicate `keys + key_pe` in `TwoWayAttentionBlock`