Skip to content

test: add receive two-session lock harness#175

Open
alceops wants to merge 2 commits intoNikolayS:mainfrom
alceops:alce/two-session-receive-lock-harness
Open

test: add receive two-session lock harness#175
alceops wants to merge 2 commits intoNikolayS:mainfrom
alceops:alce/two-session-receive-lock-harness

Conversation

@alceops
Copy link
Copy Markdown

@alceops alceops commented May 2, 2026

Adds an optional two-session psql harness for the same-consumer pgque.receive() race discussed in #125.

The existing SQL regression coverage is single-session, so this script opens two real backend sessions:

  • session 1 receives one message and holds the transaction open
  • session 2 concurrently calls pgque.receive() for the same (queue, consumer)
  • the harness expects session 2 to wait on the row lock and return zero duplicate rows

I left it as an explicit script instead of wiring it into CI because it is meant as a red/green validator for the held #125 fix: current pre-fix code fails, while the row-lock fix should make it pass.

Verification run locally against current main install (expected red):

bash -n tests/two_session_receive_lock.sh
PGQUE_TEST_DSN=postgresql://postgres:***@localhost/pgque_test tests/two_session_receive_lock.sh
# FAIL: two-session receive harness failed (session1=0, session2=3)
# session2 waited ~3.2s, then received 1 duplicate row

Comment thread tests/two_session_receive_lock.sh Outdated
}

# Give session 1 enough time to enter receive() and hold its transaction open.
sleep 1
Copy link
Copy Markdown
Owner

@NikolayS NikolayS May 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be racy on a slow runner; perhaps better to poll until session 1 is idle-in-tx or sub_batch is not null

smth like:

Suggested change
sleep 1
for i in $(seq 1 50); do
"${psql_base[@]}" -tAc "
select from pgque.subscription s
join pgque.queue q on q.queue_id = s.sub_queue
where q.queue_name = '${queue_name}' and s.sub_batch is not null
" | grep -q 1 && break
sleep 0.2
done

Comment thread tests/two_session_receive_lock.sh Outdated
v_count integer;
begin
select count(*) into v_count from s2_receive;
assert v_count = 0, format('session2 must not receive duplicate message, got %s', v_count);
Copy link
Copy Markdown
Owner

@NikolayS NikolayS May 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this assert is going to be changed once #125 lands.

thinking:

the harness is meant to be RED on current main (which it is — your run shows session 2 returns 1 row after waiting ~3.2s) and GREEN once a fix lands

there is a concern related to the GREEN target.

Both the next_batch_custom() early-return at sql/pgque.sql:2127 (if batch_id is not null then return;) and your own description of the intended fix in the #125 thread — "assert session 2 waits and returns the same batch_id after session 1 commits" — point to v_count = 1, not 0. A FOR UPDATE OF s fix serializes the two next_batch_custom calls; the second one sees sub_batch = batch_A and returns it idempotently, then get_batch_events(batch_A) yields the same row.

If the assertion stays at v_count = 0, the harness will stay RED even after the intended fix lands and read as "fix is broken" to whoever runs it next.

how could resolve it:

Suggested change
assert v_count = 0, format('session2 must not receive duplicate message, got %s', v_count);
assert v_count = 1, format('session2 expected idempotent re-fetch (1 row), got %s', v_count);
-- and assert s2_receive.batch_id = s1_receive.batch_id

The "no double-delivery" guarantee here is same batch_id, same rows (single-worker contract), not zero rows in the second session

Comment thread tests/two_session_receive_lock.sh Outdated
fi

elapsed=$((end_epoch - start_epoch))
if (( elapsed < 2 )); then
Copy link
Copy Markdown
Owner

@NikolayS NikolayS May 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thinking maybe < 3 is better; with pg_sleep(4) starting near T+0 and session 2 launching at T+1, the wait is ~3s. The < 2 threshold lets a 2.0s response (which would mean the lock didn't actually serialize) pass.

cleanup() {
rm -rf "${workdir}"
}
trap cleanup EXIT
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for "absolute cleanup", worth adding to cleanup:

select pgque.unregister_consumer('${queue_name}', 'c1');
select pgque.drop_queue('${queue_name}', true);

@@ -0,0 +1,108 @@
#!/usr/bin/env bash
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add the Apache-2.0 / PgQ derivation header that the rest of tests/ carries (see e.g. tests/test_concurrent_receive.sql).

Comment thread tests/two_session_receive_lock.sh Outdated
select count(*) into v_count from s1_receive;
assert v_count = 1, format('session1 expected 1 message, got %s', v_count);
end \$\$;
select pg_sleep(4);
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a very minor note:

Tightly coupled to the elapsed-time threshold on line 102. If you bump one, bump the other. Worth pulling both into a single HOLD_SECONDS=4 shell variable at the top so the contract is obvious.

@NikolayS
Copy link
Copy Markdown
Owner

NikolayS commented May 3, 2026

thanks @alceops!! left some comments above

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants