-
Notifications
You must be signed in to change notification settings - Fork 475
fix(profiling): workaround for on-CPU Task race condition #15750
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
Performance SLOsComparing candidate kowalski/fix-profiling-workaround-for-on-cpu-task-race-condition-fix (3788cd1) with baseline kowalski/chore-profiling-detect-cycles-in-asyncio (e7dbf7f) 📈 Performance Regressions (3 suites)📈 iastaspects - 118/118✅ add_aspectTime: ✅ 17.986µs (SLO: <20.000µs 📉 -10.1%) vs baseline: 📈 +21.1% Memory: ✅ 43.018MB (SLO: <43.250MB 🟡 -0.5%) vs baseline: +5.0% ✅ add_inplace_aspectTime: ✅ 14.892µs (SLO: <20.000µs 📉 -25.5%) vs baseline: -0.2% Memory: ✅ 42.998MB (SLO: <43.250MB 🟡 -0.6%) vs baseline: +4.9% ✅ add_inplace_noaspectTime: ✅ 0.338µs (SLO: <10.000µs 📉 -96.6%) vs baseline: -0.7% Memory: ✅ 43.037MB (SLO: <43.500MB 🟡 -1.1%) vs baseline: +5.6% ✅ add_noaspectTime: ✅ 0.546µs (SLO: <10.000µs 📉 -94.5%) vs baseline: +0.2% Memory: ✅ 42.998MB (SLO: <43.500MB 🟡 -1.2%) vs baseline: +4.7% ✅ bytearray_aspectTime: ✅ 18.000µs (SLO: <30.000µs 📉 -40.0%) vs baseline: -0.4% Memory: ✅ 42.920MB (SLO: <43.500MB 🟡 -1.3%) vs baseline: +4.8% ✅ bytearray_extend_aspectTime: ✅ 23.830µs (SLO: <30.000µs 📉 -20.6%) vs baseline: ~same Memory: ✅ 43.057MB (SLO: <43.500MB 🟡 -1.0%) vs baseline: +5.0% ✅ bytearray_extend_noaspectTime: ✅ 2.739µs (SLO: <10.000µs 📉 -72.6%) vs baseline: ~same Memory: ✅ 42.998MB (SLO: <43.500MB 🟡 -1.2%) vs baseline: +5.0% ✅ bytearray_noaspectTime: ✅ 1.464µs (SLO: <10.000µs 📉 -85.4%) vs baseline: -0.4% Memory: ✅ 43.018MB (SLO: <43.500MB 🟡 -1.1%) vs baseline: +5.6% ✅ bytes_aspectTime: ✅ 16.732µs (SLO: <20.000µs 📉 -16.3%) vs baseline: +0.3% Memory: ✅ 43.077MB (SLO: <43.500MB 🟡 -1.0%) vs baseline: +5.0% ✅ bytes_noaspectTime: ✅ 1.427µs (SLO: <10.000µs 📉 -85.7%) vs baseline: -0.3% Memory: ✅ 43.018MB (SLO: <43.500MB 🟡 -1.1%) vs baseline: +5.0% ✅ bytesio_aspectTime: ✅ 55.672µs (SLO: <70.000µs 📉 -20.5%) vs baseline: ~same Memory: ✅ 42.979MB (SLO: <43.500MB 🟡 -1.2%) vs baseline: +4.7% ✅ bytesio_noaspectTime: ✅ 3.262µs (SLO: <10.000µs 📉 -67.4%) vs baseline: -1.3% Memory: ✅ 43.037MB (SLO: <43.500MB 🟡 -1.1%) vs baseline: +5.0% ✅ capitalize_aspectTime: ✅ 14.601µs (SLO: <20.000µs 📉 -27.0%) vs baseline: -0.5% Memory: ✅ 42.998MB (SLO: <43.500MB 🟡 -1.2%) vs baseline: +4.9% ✅ capitalize_noaspectTime: ✅ 2.595µs (SLO: <10.000µs 📉 -74.1%) vs baseline: -0.4% Memory: ✅ 42.920MB (SLO: <43.500MB 🟡 -1.3%) vs baseline: +4.6% ✅ casefold_aspectTime: ✅ 14.619µs (SLO: <20.000µs 📉 -26.9%) vs baseline: -0.2% Memory: ✅ 43.057MB (SLO: <43.500MB 🟡 -1.0%) vs baseline: +5.0% ✅ casefold_noaspectTime: ✅ 3.154µs (SLO: <10.000µs 📉 -68.5%) vs baseline: ~same Memory: ✅ 43.018MB (SLO: <43.500MB 🟡 -1.1%) vs baseline: +5.0% ✅ decode_aspectTime: ✅ 15.588µs (SLO: <30.000µs 📉 -48.0%) vs baseline: ~same Memory: ✅ 43.037MB (SLO: <43.500MB 🟡 -1.1%) vs baseline: +5.0% ✅ decode_noaspectTime: ✅ 1.590µs (SLO: <10.000µs 📉 -84.1%) vs baseline: -1.7% Memory: ✅ 42.959MB (SLO: <43.500MB 🟡 -1.2%) vs baseline: +4.9% ✅ encode_aspectTime: ✅ 18.178µs (SLO: <30.000µs 📉 -39.4%) vs baseline: 📈 +23.0% Memory: ✅ 42.979MB (SLO: <43.500MB 🟡 -1.2%) vs baseline: +4.8% ✅ encode_noaspectTime: ✅ 1.515µs (SLO: <10.000µs 📉 -84.8%) vs baseline: +0.7% Memory: ✅ 42.979MB (SLO: <43.500MB 🟡 -1.2%) vs baseline: +4.6% ✅ format_aspectTime: ✅ 171.073µs (SLO: <200.000µs 📉 -14.5%) vs baseline: ~same Memory: ✅ 43.077MB (SLO: <43.250MB 🟡 -0.4%) vs baseline: +4.3% ✅ format_map_aspectTime: ✅ 190.726µs (SLO: <200.000µs -4.6%) vs baseline: -0.2% Memory: ✅ 43.195MB (SLO: <43.500MB 🟡 -0.7%) vs baseline: +5.0% ✅ format_map_noaspectTime: ✅ 3.812µs (SLO: <10.000µs 📉 -61.9%) vs baseline: +0.2% Memory: ✅ 42.959MB (SLO: <43.250MB 🟡 -0.7%) vs baseline: +4.8% ✅ format_noaspectTime: ✅ 3.154µs (SLO: <10.000µs 📉 -68.5%) vs baseline: ~same Memory: ✅ 42.959MB (SLO: <43.250MB 🟡 -0.7%) vs baseline: +4.8% ✅ index_aspectTime: ✅ 15.222µs (SLO: <20.000µs 📉 -23.9%) vs baseline: -0.8% Memory: ✅ 43.057MB (SLO: <43.250MB 🟡 -0.4%) vs baseline: +5.0% ✅ index_noaspectTime: ✅ 0.463µs (SLO: <10.000µs 📉 -95.4%) vs baseline: -0.5% Memory: ✅ 42.998MB (SLO: <43.500MB 🟡 -1.2%) vs baseline: +4.9% ✅ join_aspectTime: ✅ 17.051µs (SLO: <20.000µs 📉 -14.7%) vs baseline: -0.2% Memory: ✅ 42.959MB (SLO: <43.500MB 🟡 -1.2%) vs baseline: +5.0% ✅ join_noaspectTime: ✅ 1.555µs (SLO: <10.000µs 📉 -84.5%) vs baseline: -1.3% Memory: ✅ 42.979MB (SLO: <43.250MB 🟡 -0.6%) vs baseline: +5.0% ✅ ljust_aspectTime: ✅ 20.854µs (SLO: <30.000µs 📉 -30.5%) vs baseline: +0.3% Memory: ✅ 42.959MB (SLO: <43.250MB 🟡 -0.7%) vs baseline: +4.7% ✅ ljust_noaspectTime: ✅ 2.714µs (SLO: <10.000µs 📉 -72.9%) vs baseline: -0.4% Memory: ✅ 42.939MB (SLO: <43.250MB 🟡 -0.7%) vs baseline: +4.9% ✅ lower_aspectTime: ✅ 17.918µs (SLO: <30.000µs 📉 -40.3%) vs baseline: +0.2% Memory: ✅ 42.979MB (SLO: <43.500MB 🟡 -1.2%) vs baseline: +4.9% ✅ lower_noaspectTime: ✅ 2.427µs (SLO: <10.000µs 📉 -75.7%) vs baseline: -0.3% Memory: ✅ 42.959MB (SLO: <43.250MB 🟡 -0.7%) vs baseline: +4.8% ✅ lstrip_aspectTime: ✅ 17.714µs (SLO: <30.000µs 📉 -41.0%) vs baseline: +0.2% Memory: ✅ 42.979MB (SLO: <43.250MB 🟡 -0.6%) vs baseline: +4.8% ✅ lstrip_noaspectTime: ✅ 1.884µs (SLO: <10.000µs 📉 -81.2%) vs baseline: +0.7% Memory: ✅ 42.979MB (SLO: <43.500MB 🟡 -1.2%) vs baseline: +4.8% ✅ modulo_aspectTime: ✅ 166.437µs (SLO: <200.000µs 📉 -16.8%) vs baseline: +0.2% Memory: ✅ 43.116MB (SLO: <43.500MB 🟡 -0.9%) vs baseline: +4.7% ✅ modulo_aspect_for_bytearray_bytearrayTime: ✅ 180.157µs (SLO: <200.000µs -9.9%) vs baseline: +2.9% Memory: ✅ 43.155MB (SLO: <43.500MB 🟡 -0.8%) vs baseline: +5.3% ✅ modulo_aspect_for_bytesTime: ✅ 168.706µs (SLO: <200.000µs 📉 -15.6%) vs baseline: ~same Memory: ✅ 43.195MB (SLO: <43.500MB 🟡 -0.7%) vs baseline: +5.0% ✅ modulo_aspect_for_bytes_bytearrayTime: ✅ 172.022µs (SLO: <200.000µs 📉 -14.0%) vs baseline: ~same Memory: ✅ 43.136MB (SLO: <43.500MB 🟡 -0.8%) vs baseline: +5.0% ✅ modulo_noaspectTime: ✅ 3.667µs (SLO: <10.000µs 📉 -63.3%) vs baseline: -0.3% Memory: ✅ 43.057MB (SLO: <43.500MB 🟡 -1.0%) vs baseline: +5.0% ✅ replace_aspectTime: ✅ 211.683µs (SLO: <300.000µs 📉 -29.4%) vs baseline: -0.2% Memory: ✅ 43.155MB (SLO: <44.000MB 🟡 -1.9%) vs baseline: +5.1% ✅ replace_noaspectTime: ✅ 2.893µs (SLO: <10.000µs 📉 -71.1%) vs baseline: -0.8% Memory: ✅ 42.998MB (SLO: <43.500MB 🟡 -1.2%) vs baseline: +5.0% ✅ repr_aspectTime: ✅ 1.422µs (SLO: <10.000µs 📉 -85.8%) vs baseline: +0.6% Memory: ✅ 42.959MB (SLO: <43.500MB 🟡 -1.2%) vs baseline: +4.8% ✅ repr_noaspectTime: ✅ 0.529µs (SLO: <10.000µs 📉 -94.7%) vs baseline: +1.3% Memory: ✅ 43.018MB (SLO: <43.500MB 🟡 -1.1%) vs baseline: +4.6% ✅ rstrip_aspectTime: ✅ 18.920µs (SLO: <30.000µs 📉 -36.9%) vs baseline: +0.1% Memory: ✅ 43.018MB (SLO: <43.500MB 🟡 -1.1%) vs baseline: +4.8% ✅ rstrip_noaspectTime: ✅ 2.034µs (SLO: <10.000µs 📉 -79.7%) vs baseline: +6.7% Memory: ✅ 42.979MB (SLO: <43.500MB 🟡 -1.2%) vs baseline: +4.9% ✅ slice_aspectTime: ✅ 15.893µs (SLO: <20.000µs 📉 -20.5%) vs baseline: -0.2% Memory: ✅ 43.018MB (SLO: <43.500MB 🟡 -1.1%) vs baseline: +4.7% ✅ slice_noaspectTime: ✅ 0.597µs (SLO: <10.000µs 📉 -94.0%) vs baseline: ~same Memory: ✅ 42.979MB (SLO: <43.500MB 🟡 -1.2%) vs baseline: +5.0% ✅ stringio_aspectTime: ✅ 53.677µs (SLO: <80.000µs 📉 -32.9%) vs baseline: -1.1% Memory: ✅ 42.998MB (SLO: <43.500MB 🟡 -1.2%) vs baseline: +4.9% ✅ stringio_noaspectTime: ✅ 3.640µs (SLO: <10.000µs 📉 -63.6%) vs baseline: -0.1% Memory: ✅ 42.900MB (SLO: <43.500MB 🟡 -1.4%) vs baseline: +4.7% ✅ strip_aspectTime: ✅ 17.599µs (SLO: <20.000µs 📉 -12.0%) vs baseline: -0.2% Memory: ✅ 42.998MB (SLO: <43.500MB 🟡 -1.2%) vs baseline: +5.0% ✅ strip_noaspectTime: ✅ 1.868µs (SLO: <10.000µs 📉 -81.3%) vs baseline: +0.6% Memory: ✅ 43.037MB (SLO: <43.500MB 🟡 -1.1%) vs baseline: +4.9% ✅ swapcase_aspectTime: ✅ 18.481µs (SLO: <30.000µs 📉 -38.4%) vs baseline: -0.5% Memory: ✅ 42.959MB (SLO: <43.500MB 🟡 -1.2%) vs baseline: +4.5% ✅ swapcase_noaspectTime: ✅ 2.817µs (SLO: <10.000µs 📉 -71.8%) vs baseline: +0.3% Memory: ✅ 42.900MB (SLO: <43.500MB 🟡 -1.4%) vs baseline: +4.6% ✅ title_aspectTime: ✅ 18.277µs (SLO: <30.000µs 📉 -39.1%) vs baseline: +0.7% Memory: ✅ 43.018MB (SLO: <43.000MB ~same) vs baseline: +4.9% ✅ title_noaspectTime: ✅ 2.681µs (SLO: <10.000µs 📉 -73.2%) vs baseline: ~same Memory: ✅ 42.998MB (SLO: <43.500MB 🟡 -1.2%) vs baseline: +4.9% ✅ translate_aspectTime: ✅ 24.216µs (SLO: <30.000µs 📉 -19.3%) vs baseline: 📈 +17.8% Memory: ✅ 42.979MB (SLO: <43.500MB 🟡 -1.2%) vs baseline: +4.7% ✅ translate_noaspectTime: ✅ 4.345µs (SLO: <10.000µs 📉 -56.5%) vs baseline: ~same Memory: ✅ 42.939MB (SLO: <43.500MB 🟡 -1.3%) vs baseline: +4.7% ✅ upper_aspectTime: ✅ 17.807µs (SLO: <30.000µs 📉 -40.6%) vs baseline: -0.1% Memory: ✅ 43.116MB (SLO: <43.500MB 🟡 -0.9%) vs baseline: +5.1% ✅ upper_noaspectTime: ✅ 2.455µs (SLO: <10.000µs 📉 -75.5%) vs baseline: +1.0% Memory: ✅ 42.998MB (SLO: <43.500MB 🟡 -1.2%) vs baseline: +4.8% 📈 iastaspectsospath - 24/24✅ ospathbasename_aspectTime: ✅ 5.186µs (SLO: <10.000µs 📉 -48.1%) vs baseline: 📈 +21.7% Memory: ✅ 41.386MB (SLO: <43.500MB -4.9%) vs baseline: +4.7% ✅ ospathbasename_noaspectTime: ✅ 4.328µs (SLO: <10.000µs 📉 -56.7%) vs baseline: +0.4% Memory: ✅ 41.504MB (SLO: <43.500MB -4.6%) vs baseline: +5.3% ✅ ospathjoin_aspectTime: ✅ 6.248µs (SLO: <10.000µs 📉 -37.5%) vs baseline: -0.1% Memory: ✅ 41.406MB (SLO: <43.500MB -4.8%) vs baseline: +4.8% ✅ ospathjoin_noaspectTime: ✅ 6.278µs (SLO: <10.000µs 📉 -37.2%) vs baseline: -0.5% Memory: ✅ 41.445MB (SLO: <43.500MB -4.7%) vs baseline: +4.9% ✅ ospathnormcase_aspectTime: ✅ 3.591µs (SLO: <10.000µs 📉 -64.1%) vs baseline: +0.5% Memory: ✅ 41.366MB (SLO: <43.500MB -4.9%) vs baseline: +4.9% ✅ ospathnormcase_noaspectTime: ✅ 3.634µs (SLO: <10.000µs 📉 -63.7%) vs baseline: +1.2% Memory: ✅ 41.406MB (SLO: <43.500MB -4.8%) vs baseline: +4.7% ✅ ospathsplit_aspectTime: ✅ 4.915µs (SLO: <10.000µs 📉 -50.8%) vs baseline: +0.4% Memory: ✅ 41.504MB (SLO: <43.500MB -4.6%) vs baseline: +5.2% ✅ ospathsplit_noaspectTime: ✅ 4.961µs (SLO: <10.000µs 📉 -50.4%) vs baseline: -0.9% Memory: ✅ 41.484MB (SLO: <43.500MB -4.6%) vs baseline: +5.2% ✅ ospathsplitdrive_aspectTime: ✅ 3.757µs (SLO: <10.000µs 📉 -62.4%) vs baseline: ~same Memory: ✅ 41.366MB (SLO: <43.500MB -4.9%) vs baseline: +4.9% ✅ ospathsplitdrive_noaspectTime: ✅ 0.748µs (SLO: <10.000µs 📉 -92.5%) vs baseline: +0.7% Memory: ✅ 41.425MB (SLO: <43.500MB -4.8%) vs baseline: +5.0% ✅ ospathsplitext_aspectTime: ✅ 4.658µs (SLO: <10.000µs 📉 -53.4%) vs baseline: +1.0% Memory: ✅ 41.465MB (SLO: <43.500MB -4.7%) vs baseline: +5.0% ✅ ospathsplitext_noaspectTime: ✅ 4.630µs (SLO: <10.000µs 📉 -53.7%) vs baseline: -0.4% Memory: ✅ 41.406MB (SLO: <43.500MB -4.8%) vs baseline: +4.9% 📈 telemetryaddmetric - 30/30✅ 1-count-metric-1-timesTime: ✅ 3.386µs (SLO: <20.000µs 📉 -83.1%) vs baseline: 📈 +14.6% Memory: ✅ 34.918MB (SLO: <35.500MB 🟡 -1.6%) vs baseline: +4.8% ✅ 1-count-metrics-100-timesTime: ✅ 198.837µs (SLO: <220.000µs -9.6%) vs baseline: ~same Memory: ✅ 34.741MB (SLO: <35.500MB -2.1%) vs baseline: +4.5% ✅ 1-distribution-metric-1-timesTime: ✅ 3.334µs (SLO: <20.000µs 📉 -83.3%) vs baseline: +1.3% Memory: ✅ 34.819MB (SLO: <35.500MB 🟡 -1.9%) vs baseline: +4.7% ✅ 1-distribution-metrics-100-timesTime: ✅ 213.272µs (SLO: <230.000µs -7.3%) vs baseline: -0.3% Memory: ✅ 34.839MB (SLO: <35.500MB 🟡 -1.9%) vs baseline: +4.9% ✅ 1-gauge-metric-1-timesTime: ✅ 2.163µs (SLO: <20.000µs 📉 -89.2%) vs baseline: -1.1% Memory: ✅ 34.898MB (SLO: <35.500MB 🟡 -1.7%) vs baseline: +5.1% ✅ 1-gauge-metrics-100-timesTime: ✅ 137.647µs (SLO: <150.000µs -8.2%) vs baseline: +0.5% Memory: ✅ 34.819MB (SLO: <35.500MB 🟡 -1.9%) vs baseline: +4.7% ✅ 1-rate-metric-1-timesTime: ✅ 3.083µs (SLO: <20.000µs 📉 -84.6%) vs baseline: -1.2% Memory: ✅ 34.918MB (SLO: <35.500MB 🟡 -1.6%) vs baseline: +5.2% ✅ 1-rate-metrics-100-timesTime: ✅ 213.092µs (SLO: <250.000µs 📉 -14.8%) vs baseline: -0.1% Memory: ✅ 34.878MB (SLO: <35.500MB 🟡 -1.8%) vs baseline: +4.9% ✅ 100-count-metrics-100-timesTime: ✅ 19.981ms (SLO: <22.000ms -9.2%) vs baseline: -0.3% Memory: ✅ 34.819MB (SLO: <35.500MB 🟡 -1.9%) vs baseline: +4.5% ✅ 100-distribution-metrics-100-timesTime: ✅ 2.271ms (SLO: <2.550ms 📉 -10.9%) vs baseline: +2.7% Memory: ✅ 34.878MB (SLO: <35.500MB 🟡 -1.8%) vs baseline: +4.9% ✅ 100-gauge-metrics-100-timesTime: ✅ 1.401ms (SLO: <1.550ms -9.6%) vs baseline: -0.3% Memory: ✅ 34.918MB (SLO: <35.500MB 🟡 -1.6%) vs baseline: +4.9% ✅ 100-rate-metrics-100-timesTime: ✅ 2.197ms (SLO: <2.550ms 📉 -13.9%) vs baseline: +0.6% Memory: ✅ 34.839MB (SLO: <35.500MB 🟡 -1.9%) vs baseline: +4.9% ✅ flush-1-metricTime: ✅ 4.635µs (SLO: <20.000µs 📉 -76.8%) vs baseline: -0.8% Memory: ✅ 35.134MB (SLO: <35.500MB 🟡 -1.0%) vs baseline: +4.6% ✅ flush-100-metricsTime: ✅ 174.151µs (SLO: <250.000µs 📉 -30.3%) vs baseline: -0.3% Memory: ✅ 35.291MB (SLO: <35.500MB 🟡 -0.6%) vs baseline: +4.6% ✅ flush-1000-metricsTime: ✅ 2.170ms (SLO: <2.500ms 📉 -13.2%) vs baseline: -0.3% Memory: ✅ 36.097MB (SLO: <36.500MB 🟡 -1.1%) vs baseline: +5.2% 🟡 Near SLO Breach (15 suites)🟡 coreapiscenario - 10/10 (1 unstable)
|
8e87700 to
dd87f9f
Compare
61cb3a9 to
ffe6824
Compare
fe7b099 to
fc65590
Compare
ffe6824 to
038ea8d
Compare
7bb5c3e to
3788cd1
Compare
fc65590 to
e7dbf7f
Compare
|
Superseded by #15780. |
Description
What is this about?
This PR updates the Task unwinding logic in the Profiler to (more) properly handle race conditions around running/"on CPU" Tasks. A Task can be either in a running state (i.e. actively computing something itself, like executing a regular Python function) or in a sleeping state (i.e. waiting for something else to happen to wake up).
Why do we need it?
Because we don't take a "snapshot of the whole Python process at once", there is a race condition in our Sampler.
We first capture the Thread Stack (i.e. for the current Thread, if it is running, what Python code the interpreter is running), then for each Task in the Thread's Event Loop [if it exists] we look at the Task's own Stack. (Since Task/Coroutines are pausable, they have their own Stack that is kept in memory when they're paused, then re-loaded into context when they're resumed. Walking each Task's Stack allows us to e.g. know what code they're "running", even when they aren't actually currently running code...)
Going back to the race condition question, we may have a discrepancy between what the Python Thread Stack tells us (what the interpreter is running) and what Task objects themselves tell us (because a tiny amount of time actually elapses between the moment we capture the Thread Stack and the moment we inspect the Task objects, so what is happening may have changed in the meantime).
I've already in the past gone into more detail regarding what buggy/unexpected behaviour may result from that race condition; this PR improves this.
Note that there is a pretty obvious tradeoff here. When we detect a discrepancy, we can:
For the time being, things can only get better because we're in a state where we don't deal with the problem at all. The current PR biases towards the last possibility: we skip Samples that we know will be bogus. If this happens sufficiently rarely [a claim I still need numbers to back] then this is OK.
How does it work?
The main issue we address here is the race condition where what the Python Thread Stack tells us is different from what unwinding the Stack for each Task tells us. If the Python Thread Stack appears as running a Task, then we need to make sure that we do see a Task runjn
To detect that race condition, we walk the Python Stack (once per Thread) to detect whether we see
Handle.runFrames – those indicate that the Event Loop is currently stepping the Coroutine – in other words executing code.When that happens, we expect at least one Task to be marked as running (there could be more – that's also a race condition, but it's OK, as far as CPU Time is not concerned...)
The main problem we are trying to avoid here is having some of Task A's Frames appearing as part of Task B's Stack. Working around this requires properly splitting the Python Stack when it says it is running a Task, such that we only push the
asyncioruntime Frames on top of each non-Task A Task. Walking the Python Stack allows us to do that properly.What does this cost us?
This is not completely free – we're doing more work (namely, walking the stack at each Sample). Looking at Full Host Profiles on a high-CPU
asyncio-based Python script, I'm getting the following difference.Note that the total Profiler overhead is about 360ms/minute, meaning the additional ~20ms we're using here represent an extra 5% overhead. Given the importance of getting Stacks right (or at least not completely wrong), I'd say it's worth it, but it's still noticeable.