Skip to content

Fixing flaky integration test: TestCrashAfterRejectDoesNotLoseMessages#274

Merged
behinddwalls merged 1 commit into
mainfrom
fix-flaky-crash-reject-test
Jun 25, 2026
Merged

Fixing flaky integration test: TestCrashAfterRejectDoesNotLoseMessages#274
behinddwalls merged 1 commit into
mainfrom
fix-flaky-crash-reject-test

Conversation

@ubettigole

Copy link
Copy Markdown
Contributor

Summary

TestCrashAfterRejectDoesNotLoseMessages was flaky because it asserted consumer lag immediately after Ack(), but Ack() only marks the delivery state — watermark advancement (which updates offset_acked) is deferred to the next poll loop tick. The test raced against the poll loop: if the lag check ran before advanceWatermark, offset_acked was still stale and lag was non-zero.

Fix by adding OnSignal to worker-2's queue and calling waitForSignal after acks, ensuring the poll loop has run advanceWatermark before we check lag. This matches the pattern already used by TestWatermarkAdvancesContiguously.

Test Plan

Ran the test 50 times to confirm it's no longer flaky:

./tool/bazel test //test/integration/extension/messagequeue/mysql:mysql_test --test_filter='TestSQLQueueIntegration/TestCrashAfterRejectDoesNotLoseMessages' --runs_per_test=50 --test_output=summary --jobs=4
     INFO: Found 1 test target...
     Target //test/integration/extension/messagequeue/mysql:mysql_test up-to-date:
       bazel-bin/test/integration/extension/messagequeue/mysql/mysql_test_/mysql_test
     INFO: Elapsed time: 205.664s, Critical Path: 20.62s
     INFO: 51 processes: 9 action cache hit, 1 internal, 50 darwin-sandbox.
     INFO: Build completed successfully, 51 total actions
     //test/integration/extension/messagequeue/mysql:mysql_test               PASSED in 20.6s
       Stats over 50 runs: max = 20.6s, min = 10.0s, avg = 16.0s, dev = 2.0s

     Executed 1 out of 1 test: 1 test passes.

Issue

…test

TestCrashAfterRejectDoesNotLoseMessages was flaky because it asserted
consumer lag immediately after Ack(), but Ack() only marks the delivery
state — watermark advancement (which updates offset_acked) is deferred
to the next poll loop tick. The test raced against the poll loop: if
the lag check ran before advanceWatermark, offset_acked was still stale
and lag was non-zero.

Fix by adding OnSignal to worker-2's queue and calling waitForSignal
after acks, ensuring the poll loop has run advanceWatermark before we
check lag. This matches the pattern already used by
TestWatermarkAdvancesContiguously.

Verified with 50 consecutive passes (0 failures).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ubettigole ubettigole requested review from a team, behinddwalls and sbalabanov as code owners June 25, 2026 18:17
@behinddwalls behinddwalls added this pull request to the merge queue Jun 25, 2026
Merged via the queue into main with commit b573988 Jun 25, 2026
15 checks passed
@behinddwalls behinddwalls deployed to stack-rebase June 25, 2026 18:54 — with GitHub Actions Active
@behinddwalls behinddwalls deleted the fix-flaky-crash-reject-test branch June 25, 2026 18:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants