Description
Version
v18.13.0
Platform
Linux devloop-Surface-Laptop-4 5.15.0-58-generic #64-Ubuntu SMP Thu Jan 5 11:43:13 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Subsystem
No response
What steps will reproduce the bug?
This was originally encountered when trying to capture a trace of the TypeScript compiler running, however I can also reliably reproduce the same crash with a much simpler invocation that doesn't require setting up/installing a complex project.
$ node --trace-events-enabled /usr/lib/node_modules/npm/bin/npm-cli.js set init-license MIT
Aborted (core dumped)
How often does it reproduce? Is there a required condition?
See "additional information" below for a possible condition. From my experience, the commands that cause a crash - like the one from above - reproduce 100% of the time. That said, not all invocations with tracing enabled will reproduce the issue.
What is the expected behavior?
The node process does not crash and the trace file is written to disk in its entirety
What do you see instead?
Aborted (core dumped)
On top of that, the trace file contains incomplete JSON
Additional information
Running under the debugger, I see a crash (abort
) with this stack:
Thread 2 (Thread 0x7ffff7a42640 (LWP 2148334) "node"):
#0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=140737348118080) at ./nptl/pthread_kill.c:44
#1 __pthread_kill_internal (signo=6, threadid=140737348118080) at ./nptl/pthread_kill.c:78
#2 __GI___pthread_kill (threadid=140737348118080, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3 0x00007ffff7a88476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4 0x00007ffff7a6e7f3 in __GI_abort () at ./stdlib/abort.c:79
#5 0x0000000000a9d38a in uv_mutex_lock (mutex=mutex@entry=0x5278de0 <mutex>) at ../deps/uv/src/unix/thread.c:352
#6 0x000000000164dfe8 in post (kind=UV__WORK_FAST_IO, q=0x5328190) at ../deps/uv/src/threadpool.c:143
#7 uv__work_submit (loop=loop@entry=0x5415bd8, w=w@entry=0x5328178, kind=kind@entry=UV__WORK_FAST_IO, work=work@entry=0x1656270 <uv__fs_work>, done=done@entry=0x1655b60 <uv__fs_done>) at ../deps/uv/src/threadpool.c:270
#8 0x000000000165953d in uv_fs_write (loop=0x5415bd8, req=0x5328028, file=<optimized out>, bufs=0x7ffff7a3e720, nbufs=<optimized out>, off=-1, cb=0xc685f0 <node::tracing::NodeTraceWriter::StartWrite(uv_buf_t)::{lambda(uv_fs_s*)#1}::_FUN(uv_fs_s*)>) at ../deps/uv/src/unix/fs.c:2191
#9 0x0000000000c687ce in node::tracing::NodeTraceWriter::WriteToFile(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&, int) [clone .part.70] ()
#10 0x0000000000c68e2a in node::tracing::NodeTraceWriter::FlushPrivate() ()
#11 0x0000000001652906 in uv__async_io (loop=0x5415bd8, w=<optimized out>, events=<optimized out>) at ../deps/uv/src/unix/async.c:163
#12 0x0000000001664e44 in uv__io_poll (loop=loop@entry=0x5415bd8, timeout=<optimized out>) at ../deps/uv/src/unix/epoll.c:374
#13 0x000000000165326e in uv_run (loop=0x5415bd8, mode=UV_RUN_DEFAULT) at ../deps/uv/src/unix/core.c:406
#14 0x00007ffff7adab43 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#15 0x00007ffff7b6ca00 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
The code is calling abort
because acquiring a mutex is failing
void uv_mutex_lock(uv_mutex_t* mutex) {
if (pthread_mutex_lock(mutex))
abort();
}
If I look at other threads, the main thread is executing here:
Thread 1 (Thread 0x7ffff7a437c0 (LWP 2148331) "node"):
#0 __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x5327fe8) at ./nptl/futex-internal.c:57
#1 __futex_abstimed_wait_common (cancel=true, private=0, abstime=0x0, clockid=0, expected=0, futex_word=0x5327fe8) at ./nptl/futex-internal.c:87
#2 __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x5327fe8, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0) at ./nptl/futex-internal.c:139
#3 0x00007ffff7ad9ac1 in __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x5327f98, cond=0x5327fc0) at ./nptl/pthread_cond_wait.c:503
#4 ___pthread_cond_wait (cond=0x5327fc0, mutex=0x5327f98) at ./nptl/pthread_cond_wait.c:627
#5 0x0000000001661879 in uv_cond_wait (cond=<optimized out>, mutex=<optimized out>) at ../deps/uv/src/unix/thread.c:800
#6 0x0000000000c67993 in node::tracing::NodeTraceWriter::Flush(bool) ()
#7 0x0000000000c6422d in node::tracing::Agent::Flush(bool) ()
#8 0x0000000000c66c2b in node::tracing::NodeTraceBuffer::Flush() ()
#9 0x0000000001ad3222 in v8::platform::tracing::TracingController::StopTracing() ()
#10 0x0000000000c6504a in node::tracing::Agent::Disconnect(int) ()
#11 0x0000000000ab39f3 in node::DefaultProcessExitHandler(node::Environment*, int) ()
#12 0x0000000000af5083 in node::Environment::Exit(int) ()
#13 0x0000000000db0230 in v8::internal::MaybeHandle<v8::internal::Object> v8::internal::(anonymous namespace)::HandleApiCallHelper<false>(v8::internal::Isolate*, v8::internal::Handle<v8::internal::HeapObject>, v8::internal::Handle<v8::internal::HeapObject>, v8::internal::Handle<v8::internal::FunctionTemplateInfo>, v8::internal::Handle<v8::internal::Object>, v8::internal::BuiltinArguments) ()
#14 0x0000000000db176f in v8::internal::Builtin_HandleApiCall(int, unsigned long*, v8::internal::Isolate*) ()
#15 0x00000000016ef579 in Builtins_CEntry_Return1_DontSaveFPRegs_ArgvOnStack_BuiltinExit ()
#16 0x00000000016734d0 in Builtins_InterpreterEntryTrampoline ()
#17 0x00002ec5934c15b9 in ?? ()
#18 0x000006fbf30c60e9 in ?? ()
#19 0x0000000600000000 in ?? ()
#20 0x00002ec5934c1689 in ?? ()
#21 0x000006fbf30c27d9 in ?? ()
#22 0x0000000000000000 in ?? ()
My non-expert guess is that the JS code calls process.exit
and the tear-down of state leads to a race condition where the mutex is trying to be used after it has been destroyed. That said, trying to repro this with a simple example that just calls process.exit(0)
does not produce the same crash, so I'm guessing that there's at least some amount of unpredictability in the repro (or my assessment could be wrong).
I suppose it's also worth mentioning that I haven't observed this failure on Windows, but that's perhaps because EnterCriticalSection
has void
return type, along with the fact that calling EnterCriticalSection
after DeleteCriticalSection
doesn't necessarily cause problems because a lot of the necessary state "lives" inside of the CRITICAL_SECTION
object.