Skip to content

Fix TSan warning in sub-interpreter test #5729

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 18, 2025

Conversation

b-pass
Copy link
Contributor

@b-pass b-pass commented Jun 16, 2025

Description

I ran test_embed (including the sub-interpreter tests) with -fsanitize=thread. ThreadSanitizer complains about the internals singleton pointer being changed (null'd) from multiple different threads during sub-interpreter destruction.

I was hoping to find a cause for sporadic failures of the sub-interpreter test in ubuntu-latest, 3.12, -DPYBIND11_TEST_SMART_HOLDER=ON -DPYBIND11_SIMPLE_GIL_MANAGEMENT=ON.

I am not sure if this is the issue, I was unable to reproduce the test failure locally.

The TSan output before this patch:

==================
WARNING: ThreadSanitizer: data race (pid=331420)
  Write of size 8 at 0x7fc9bb875e00 by thread T14:
    #0 pybind11::detail::internals_pp_manager<pybind11::detail::internals>::unref() /home/user/pybind11/include/pybind11/detail/internals.h:519 (external_module.cpython-312-x86_64-linux-gnu.so+0x563c0) (BuildId: 696a38b51e55c8831621b2d0f13d7773a1506a70)
    #1 PyInit_external_module /home/user/pybind11/tests/test_embed/external_module.cpp:9 (external_module.cpython-312-x86_64-linux-gnu.so+0x25592) (BuildId: 696a38b51e55c8831621b2d0f13d7773a1506a70)
    #2 _PyImport_LoadDynamicModuleWithSpec ../Python/importdl.c:169 (libpython3.12.so.1.0+0x2d9a9e) (BuildId: 5c546cb03f97d86afd10e4288ac3b79cdeba1951)
    #3 operator() /home/user/pybind11/tests/test_embed/test_subinterpreter.cpp:358 (test_embed+0x1d46be) (BuildId: 000fdc14d7f2b4671dae95570541b4d13f7c5059)
    #4 __invoke_impl<void, C_A_T_C_H_T_E_S_T_6()::<lambda(int)>, int> /usr/include/c++/13/bits/invoke.h:61 (test_embed+0x1d7bab) (BuildId: 000fdc14d7f2b4671dae95570541b4d13f7c5059)
    #5 __invoke<C_A_T_C_H_T_E_S_T_6()::<lambda(int)>, int> /usr/include/c++/13/bits/invoke.h:96 (test_embed+0x1d79b2) (BuildId: 000fdc14d7f2b4671dae95570541b4d13f7c5059)
    #6 _M_invoke<0, 1> /usr/include/c++/13/bits/std_thread.h:292 (test_embed+0x1d780a) (BuildId: 000fdc14d7f2b4671dae95570541b4d13f7c5059)
    #7 operator() /usr/include/c++/13/bits/std_thread.h:299 (test_embed+0x1d7702) (BuildId: 000fdc14d7f2b4671dae95570541b4d13f7c5059)
    #8 _M_run /usr/include/c++/13/bits/std_thread.h:244 (test_embed+0x1d7614) (BuildId: 000fdc14d7f2b4671dae95570541b4d13f7c5059)
    #9 <null> <null> (libstdc++.so.6+0xecdb3) (BuildId: ca77dae775ec87540acd7218fa990c40d1c94ab1)

  Previous write of size 8 at 0x7fc9bb875e00 by thread T13:
    #0 pybind11::detail::internals_pp_manager<pybind11::detail::internals>::unref() /home/user/pybind11/include/pybind11/detail/internals.h:519 (external_module.cpython-312-x86_64-linux-gnu.so+0x563c0) (BuildId: 696a38b51e55c8831621b2d0f13d7773a1506a70)
    #1 PyInit_external_module /home/user/pybind11/tests/test_embed/external_module.cpp:9 (external_module.cpython-312-x86_64-linux-gnu.so+0x25592) (BuildId: 696a38b51e55c8831621b2d0f13d7773a1506a70)
    #2 _PyImport_LoadDynamicModuleWithSpec ../Python/importdl.c:169 (libpython3.12.so.1.0+0x2d9a9e) (BuildId: 5c546cb03f97d86afd10e4288ac3b79cdeba1951)
    #3 operator() /home/user/pybind11/tests/test_embed/test_subinterpreter.cpp:358 (test_embed+0x1d46be) (BuildId: 000fdc14d7f2b4671dae95570541b4d13f7c5059)
    #4 __invoke_impl<void, C_A_T_C_H_T_E_S_T_6()::<lambda(int)>, int> /usr/include/c++/13/bits/invoke.h:61 (test_embed+0x1d7bab) (BuildId: 000fdc14d7f2b4671dae95570541b4d13f7c5059)
    #5 __invoke<C_A_T_C_H_T_E_S_T_6()::<lambda(int)>, int> /usr/include/c++/13/bits/invoke.h:96 (test_embed+0x1d79b2) (BuildId: 000fdc14d7f2b4671dae95570541b4d13f7c5059)
    #6 _M_invoke<0, 1> /usr/include/c++/13/bits/std_thread.h:292 (test_embed+0x1d780a) (BuildId: 000fdc14d7f2b4671dae95570541b4d13f7c5059)
    #7 operator() /usr/include/c++/13/bits/std_thread.h:299 (test_embed+0x1d7702) (BuildId: 000fdc14d7f2b4671dae95570541b4d13f7c5059)
    #8 _M_run /usr/include/c++/13/bits/std_thread.h:244 (test_embed+0x1d7614) (BuildId: 000fdc14d7f2b4671dae95570541b4d13f7c5059)
    #9 <null> <null> (libstdc++.so.6+0xecdb3) (BuildId: ca77dae775ec87540acd7218fa990c40d1c94ab1)

  Location is global 'pybind11::detail::get_internals_pp_manager()::internals_pp_manager' of size 40 at 0x7fc9bb875de0 (external_module.cpython-312-x86_64-linux-gnu.so+0xade00)

  Thread T14 (tid=331436, running) created by main thread at:
    #0 pthread_create ../../../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:1022 (libtsan.so.2+0x5ac1a) (BuildId: 38097064631f7912bd33117a9c83d08b42e15571)
    #1 std::thread::_M_start_thread(std::unique_ptr<std::thread::_State, std::default_delete<std::thread::_State> >, void (*)()) <null> (libstdc++.so.6+0xeceb0) (BuildId: ca77dae775ec87540acd7218fa990c40d1c94ab1)
    #2 C_A_T_C_H_T_E_S_T_6 /home/user/pybind11/tests/test_embed/test_subinterpreter.cpp:382 (test_embed+0x1d50db) (BuildId: 000fdc14d7f2b4671dae95570541b4d13f7c5059)
    #3 Catch::TestInvokerAsFunction::invoke() const /home/user/pybind11/build12/tests/catch/catch.hpp:14330 (test_embed+0x3cdde) (BuildId: 000fdc14d7f2b4671dae95570541b4d13f7c5059)
    #4 Catch::TestCase::invoke() const /home/user/pybind11/build12/tests/catch/catch.hpp:14169 (test_embed+0x3bb96) (BuildId: 000fdc14d7f2b4671dae95570541b4d13f7c5059)
    #5 Catch::RunContext::invokeActiveTestCase() /home/user/pybind11/build12/tests/catch/catch.hpp:13025 (test_embed+0x3423b) (BuildId: 000fdc14d7f2b4671dae95570541b4d13f7c5059)
    #6 Catch::RunContext::runCurrentTest(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&) /home/user/pybind11/build12/tests/catch/catch.hpp:12998 (test_embed+0x33e68) (BuildId: 000fdc14d7f2b4671dae95570541b4d13f7c5059)
    #7 Catch::RunContext::runTest(Catch::TestCase const&) /home/user/pybind11/build12/tests/catch/catch.hpp:12759 (test_embed+0x31ee5) (BuildId: 000fdc14d7f2b4671dae95570541b4d13f7c5059)
    #8 execute /home/user/pybind11/build12/tests/catch/catch.hpp:13352 (test_embed+0x363da) (BuildId: 000fdc14d7f2b4671dae95570541b4d13f7c5059)
    #9 Catch::Session::runInternal() /home/user/pybind11/build12/tests/catch/catch.hpp:13562 (test_embed+0x37d59) (BuildId: 000fdc14d7f2b4671dae95570541b4d13f7c5059)
    #10 Catch::Session::run() /home/user/pybind11/build12/tests/catch/catch.hpp:13518 (test_embed+0x378ee) (BuildId: 000fdc14d7f2b4671dae95570541b4d13f7c5059)
    #11 int Catch::Session::run<char>(int, char const* const*) <null> (test_embed+0x9ecd1) (BuildId: 000fdc14d7f2b4671dae95570541b4d13f7c5059)
    #12 main /home/user/pybind11/tests/test_embed/catch.cpp:40 (test_embed+0x55b35) (BuildId: 000fdc14d7f2b4671dae95570541b4d13f7c5059)

  Thread T13 (tid=331435, running) created by main thread at:
    #0 pthread_create ../../../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:1022 (libtsan.so.2+0x5ac1a) (BuildId: 38097064631f7912bd33117a9c83d08b42e15571)
    #1 std::thread::_M_start_thread(std::unique_ptr<std::thread::_State, std::default_delete<std::thread::_State> >, void (*)()) <null> (libstdc++.so.6+0xeceb0) (BuildId: ca77dae775ec87540acd7218fa990c40d1c94ab1)
    #2 C_A_T_C_H_T_E_S_T_6 /home/user/pybind11/tests/test_embed/test_subinterpreter.cpp:381 (test_embed+0x1d50a2) (BuildId: 000fdc14d7f2b4671dae95570541b4d13f7c5059)
    #3 Catch::TestInvokerAsFunction::invoke() const /home/user/pybind11/build12/tests/catch/catch.hpp:14330 (test_embed+0x3cdde) (BuildId: 000fdc14d7f2b4671dae95570541b4d13f7c5059)
    #4 Catch::TestCase::invoke() const /home/user/pybind11/build12/tests/catch/catch.hpp:14169 (test_embed+0x3bb96) (BuildId: 000fdc14d7f2b4671dae95570541b4d13f7c5059)
    #5 Catch::RunContext::invokeActiveTestCase() /home/user/pybind11/build12/tests/catch/catch.hpp:13025 (test_embed+0x3423b) (BuildId: 000fdc14d7f2b4671dae95570541b4d13f7c5059)
    #6 Catch::RunContext::runCurrentTest(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&) /home/user/pybind11/build12/tests/catch/catch.hpp:12998 (test_embed+0x33e68) (BuildId: 000fdc14d7f2b4671dae95570541b4d13f7c5059)
    #7 Catch::RunContext::runTest(Catch::TestCase const&) /home/user/pybind11/build12/tests/catch/catch.hpp:12759 (test_embed+0x31ee5) (BuildId: 000fdc14d7f2b4671dae95570541b4d13f7c5059)
    #8 execute /home/user/pybind11/build12/tests/catch/catch.hpp:13352 (test_embed+0x363da) (BuildId: 000fdc14d7f2b4671dae95570541b4d13f7c5059)
    #9 Catch::Session::runInternal() /home/user/pybind11/build12/tests/catch/catch.hpp:13562 (test_embed+0x37d59) (BuildId: 000fdc14d7f2b4671dae95570541b4d13f7c5059)
    #10 Catch::Session::run() /home/user/pybind11/build12/tests/catch/catch.hpp:13518 (test_embed+0x378ee) (BuildId: 000fdc14d7f2b4671dae95570541b4d13f7c5059)
    #11 int Catch::Session::run<char>(int, char const* const*) <null> (test_embed+0x9ecd1) (BuildId: 000fdc14d7f2b4671dae95570541b4d13f7c5059)
    #12 main /home/user/pybind11/tests/test_embed/catch.cpp:40 (test_embed+0x55b35) (BuildId: 000fdc14d7f2b4671dae95570541b4d13f7c5059)

SUMMARY: ThreadSanitizer: data race /home/user/pybind11/include/pybind11/detail/internals.h:519 in pybind11::detail::internals_pp_manager<pybind11::detail::internals>::unref()
==================

last_istate_.reset();
internals_tls_p_.reset();
return;
}
#endif
internals_singleton_pp_ = nullptr;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After looking around for a few minutes I'm thinking: This will only ever be reached from unsafe_reset_internals_for_single_interpreter() in tests/test_embed/test_subinterpreter.cpp. Is that correct (and intentional)?

Would it make sense to leave a small comment to explain?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also reached when there's a single interpreter, from finalize_interpreter.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@b-pass Could you please take a look here?

It's a short conversation. It ends with this question:

Did you perhaps mean to write:

last_istate_.reset();
internals_tls_p_.reset();
if (get_num_interpreters_seen() == 1) {
    internals_singleton_pp_ = nullptr;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_num_interpreters_seen is not supposed to go down (but it does in the tests, to try not to have that state bleed between tests). Since it never goes down normally, once it increases basically we stop using the singleton_pp and start instead using the two thread locals.

So resetting the singleton_pp pointer shouldn't be necessary once the count has increased (but it was being changed before, for the tests, but created this data race in the tests).

The structure of the code (#if, code, return, #endif) mirrors the function above it, and is that way to avoid having an #else with duplicate code. But it could be changed to maybe read a little more linearly if you want....

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I'm still unclear TBH.

Considering only this specific case:

  • PYBIND11_HAS_SUBINTERPRETER_SUPPORT is true
  • get_num_interpreters_seen() == 1

I believe before this PR this will run:

last_istate_.reset();
internals_tls_p_.reset();
internals_singleton_pp_ = nullptr;

With this PR, only this:

internals_singleton_pp_ = nullptr;

I.e. the two .reset() are skipped with this PR. Is that a correct understanding?

Could you please confirm that skipping the two .reset() is intentional?

Copy link
Contributor Author

@b-pass b-pass Jun 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skipping the reset was intentional, as those are not used/touched until the count is greater than 1. But you're also correct, the real purpose of the PR was to avoid changing internals_singleton_pp_ when the count is above 1, and, so skipping the resets was not necessary.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks!

@rwgk rwgk merged commit f2c0ab8 into pybind:master Jun 18, 2025
82 checks passed
@github-actions github-actions bot added the needs changelog Possibly needs a changelog entry label Jun 18, 2025
@henryiii henryiii removed the needs changelog Possibly needs a changelog entry label Jun 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants