Skip to content

gh-151646: Fix data race between gc.get_stats() and GC in free-threading builds#151766

Merged
nascheme merged 2 commits into
python:mainfrom
zangjiucheng:gh-151646-fix
Jul 4, 2026
Merged

gh-151646: Fix data race between gc.get_stats() and GC in free-threading builds#151766
nascheme merged 2 commits into
python:mainfrom
zangjiucheng:gh-151646-fix

Conversation

@zangjiucheng

@zangjiucheng zangjiucheng commented Jun 20, 2026

Copy link
Copy Markdown
Contributor

In free-threading builds, gc_get_stats_impl() reads per-generation stats
with plain struct copies while gc_collect_main() writes the same fields
after _Py_StartTheWorld() — no synchronisation on either side.

Fix: add PyMutex stats_mutex to _gc_runtime_state (under
Py_GIL_DISABLED) and hold it around both the seven field writes in the
writer and the three struct copies in the reader. The lock is acquired after
_Py_StartTheWorld(), so no deadlock with stop-the-world. GIL build is
unchanged.

Also adds Lib/test/test_free_threading/test_gc_get_stats_race.py.

Verified with TSAN on Linux: race confirmed on main, clean on the patch.

🤖 Generated with Claude Code

sees a consistent snapshot rather than racing on these fields
(gh-151646). get_stats() reads buffer->index inside the lock so that
both writer and reader resolve the same slot under the same mutex. */
PyMutex_Lock(&gcstate->stats_mutex);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the guard macro Py_GIL_DISABLED is missing here and below the unlock.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole gc_free_threading.c file is wrapped in #ifdef Py_GIL_DISABLED so I don't think that's needed.

@nascheme

nascheme commented Jul 3, 2026

Copy link
Copy Markdown
Member

This looks okay to me. Some comments:

  • The unit test is too heavy, IMHO. It seems 50 collections is enough to trigger the TSAN warning, at least for me. I think you should just loop on the writer (gc.collect()) and exit the readers when the writer is done.
  • I suggest reducing the comments a bit. I would have a comment for the stats_mutex struct member, explaining what it protects. Where you acquire and release it you don't need a comment (people can look at the member comment to see the purpose of that mutex and when it needs to be held). The "so they can run concurrently under free-threading without a data race" part could be removed too. People generally know what a mutex is for.

Suggested unit test change:

diff --git a/Lib/test/test_free_threading/test_gc_get_stats_race.py b/Lib/test/test_free_threading/test_gc_get_stats_race.py
index a91bdce0304..dc609a98fdf 100644
--- a/Lib/test/test_free_threading/test_gc_get_stats_race.py
+++ b/Lib/test/test_free_threading/test_gc_get_stats_race.py
@@ -43,49 +43,41 @@
 
 
 # One thread hammers gc.collect() (the writer side); enough reader threads run
-# gc.get_stats() concurrently to make the race easy to observe.  The iteration
-# counts are deliberately fixed and finite so the script always terminates.
-NUM_COLLECTORS = 1
-NUM_READERS = 8
-ITERATIONS = 50_000
+# gc.get_stats() concurrently to make the race easy to observe.  Readers run
+# until the writer is done so the test duration is controlled by the collection
+# count rather than by reader loop counts.
+NUM_READERS = 4
+ITERATIONS = 50
 
 
-def _stress_get_stats_race(num_collectors=NUM_COLLECTORS,
-                           num_readers=NUM_READERS,
-                           iterations=ITERATIONS):
-    """Race gc.collect() against gc.get_stats() and return collected stats."""
+def _stress_get_stats_race(num_readers=NUM_READERS, iterations=ITERATIONS):
+    """Race gc.collect() against gc.get_stats()."""
 
-    # Synchronise the start so collectors and readers overlap for as long as
+    # Synchronise the start so the writer and readers overlap for as long as
     # possible, maximising the chance of the read and write landing on the
     # statistics struct at the same time.
-    barrier = threading.Barrier(num_collectors + num_readers)
+    done = threading.Event()
 
     def collector():
-        barrier.wait()
-        for _ in range(iterations):
-            # Writer: each full collection updates gcstate->generation_stats.
-            gc.collect()
+        try:
+            for _ in range(iterations):
+                # Writer: each full collection updates gcstate->generation_stats.
+                gc.collect()
+        finally:
+            done.set()
 
     def reader():
-        barrier.wait()
-        for _ in range(iterations):
+        while not done.is_set():
             # Reader: copies the per-generation stats structs with no lock.
             gc.get_stats()
 
-    threads = [threading.Thread(target=collector) for _ in range(num_collectors)]
-    threads += [threading.Thread(target=reader) for _ in range(num_readers)]
-
-    with threading_helper.start_threads(threads):
-        pass
+    threading_helper.run_concurrently([collector] + [reader] * num_readers)
 
 
 @threading_helper.requires_working_threading()
 class TestGCGetStatsRace(unittest.TestCase):
     def test_get_stats_collect_race(self):
-        # Use a reduced iteration count under the regular test suite so the test
-        # stays reasonably quick while still exercising the race; the standalone
-        # __main__ path below uses the full ITERATIONS for a reliable repro.
-        _stress_get_stats_race(iterations=2_000)
+        _stress_get_stats_race()
 
         # The race is benign at the Python level: gc.get_stats() must still
         # return well-formed data and the interpreter must not crash.
@@ -98,10 +90,10 @@ def test_get_stats_collect_race(self):
 
 
 if __name__ == "__main__":
-    # Standalone reproduction: run the full-size race and exit cleanly so the
-    # script can be reused as a regression check once the fix lands.
-    print(f"Racing {NUM_COLLECTORS} gc.collect() thread(s) against "
-          f"{NUM_READERS} gc.get_stats() thread(s), {ITERATIONS} iterations each...")
+    # Standalone reproduction: run the race and exit cleanly so the script can
+    # be reused as a regression check once the fix lands.
+    print(f"Racing 1 gc.collect() thread against "
+          f"{NUM_READERS} gc.get_stats() thread(s), {ITERATIONS} collections...")
     _stress_get_stats_race()
     print("Done (no Python-level crash). "
           "Run under a free-threading + TSAN build to observe the data race.")

@zangjiucheng

Copy link
Copy Markdown
Contributor Author

I will take a look after work and address comments, thanks for reviewing :)

@nascheme

nascheme commented Jul 3, 2026

Copy link
Copy Markdown
Member

I revised the unit test to use threading_helper.run_concurrently(), simplifies it a bit.

Address review feedback (nascheme) on pythonGH-151766:

- The stress test looped 8 reader threads x 50,000 iterations each. Run one
  writer for a fixed number of gc.collect() calls and have the readers loop
  until the writer signals done, so the duration is driven by the collection
  count. Use threading_helper.run_concurrently() for start synchronisation.
- Trim the stats_mutex comments: keep one comment on the struct member saying
  what it protects, and drop the now-redundant comments at the lock sites.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@zangjiucheng

Copy link
Copy Markdown
Contributor Author

Thanks, applied both suggestions.

Switched to one writer looping over a fixed number of gc.collect() calls with the readers running until it signals done (via threading_helper.run_concurrently), and dropped the counts to NUM_READERS = 4 / ITERATIONS = 50. On my machine (free-threading debug build) the in-suite test went from ~3.3 s to ~0.08 s, and the old standalone 50k-iteration repro dropped from ~80 s to ~0.08 s. It still races concurrent gc.collect() against gc.get_stats() in the same window; I'm relying on your observation that ~50 collections is enough to surface the TSan warning (I don't have a TSan build handy locally, but it passes cleanly and quickly on a regular free-threading build).

@nascheme nascheme merged commit 8b1dbb1 into python:main Jul 4, 2026
54 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants