-
Notifications
You must be signed in to change notification settings - Fork 2.3k
EmergencyReparentShard: support reachable replica tablets w/mysqld down
#18896
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
EmergencyReparentShard: support reachable replica tablets w/mysqld down
#18896
Conversation
Signed-off-by: Tim Vaillancourt <[email protected]>
Signed-off-by: Tim Vaillancourt <[email protected]>
Signed-off-by: Tim Vaillancourt <[email protected]>
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
Signed-off-by: Tim Vaillancourt <[email protected]>
| var args []string | ||
| if timeout != "" { | ||
| args = append(args, "--action_timeout", timeout) | ||
| args = append(args, "--action-timeout", timeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This resolves a warning because of underscore deprecation
Signed-off-by: Tim Vaillancourt <[email protected]>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #18896 +/- ##
==========================================
- Coverage 69.77% 69.77% -0.01%
==========================================
Files 1608 1608
Lines 214908 214938 +30
==========================================
+ Hits 149953 149967 +14
- Misses 64955 64971 +16 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Signed-off-by: Tim Vaillancourt <[email protected]>
Signed-off-by: Tim Vaillancourt <[email protected]>
Signed-off-by: Tim Vaillancourt <[email protected]>
Signed-off-by: Tim Vaillancourt <[email protected]>
EmergencyReparentShard: support reachable replicas w/mysqld downEmergencyReparentShard: support reachable replica tablets w/mysqld down
Signed-off-by: Tim Vaillancourt <[email protected]>
Signed-off-by: Tim Vaillancourt <[email protected]>
Signed-off-by: Tim Vaillancourt <[email protected]>
Signed-off-by: Tim Vaillancourt <[email protected]>
Signed-off-by: Tim Vaillancourt <[email protected]>
Signed-off-by: Tim Vaillancourt <[email protected]>
Signed-off-by: Tim Vaillancourt <[email protected]>
Signed-off-by: Tim Vaillancourt <[email protected]>
Signed-off-by: Tim Vaillancourt <[email protected]>
mattlord
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Just some minor comments. Can you please add another test case to the Test_stopReplicationAndBuildStatusMaps unit test? Unless there's not really a good way to do that?
| // we prioritize completing the reparent (availability) for the common case. If this edge case were | ||
| // to occur, errant GTID(s) will be produced; if this happens often we should return UNAVAILABLE | ||
| // from vttablet using more detailed criteria (check the pidfile + running PID, etc). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it's not worth improving this case today? The lack of detail may have been in place simply because it did not impact any operations. But now we're building logic around the meaning we infer from the response. I'd say we should do this now, provided you have an idea how to do it.
What do we get in the error? I wonder if it's not an SQLError we can extract that maps to one of these: https://dev.mysql.com/doc/refman/en/gone-away.html
You can see elsewhere in the code base where we look to see if the error contains an SQL error (a MySQL error, and if so, if the code matches 1 or more MySQL error codes). Actually... just below this here 😆
|
So, I'm trying to wrap my head around what this change effectively means. What if the replica that's down is the most advanced replica? You kinda hinted to it above, but does that mean we have a potential for data loss (i.e. changes that were acked by the replica that's now down and have not been replicated anywhere else might actually be lost)? The current behavior is to fail |
@arthurschreiber correct, unfortunately. My understanding is this is the same tradeoff was made in old-Orchestrator, cc @shlomi-noach to confirm (if it's still in cache). What is being seen in the wild is the current behaviour prevents VTOrc (which uses ERS for many things) from remediating shards in partially-unhealthy states. So basically you end up in a situation with an indefinitely broken shard, due to a replica that is often unimportant. One could page a human here (we don't automate/document this) but I don't think that is the state of the art/elegant, so to speak I'm glad you raise the point about manual intervention: erroring for a human to respond is viable on the |
|
I've been thinking about this some more, and I think there's some fundamental issues with Without semi-sync enabled (so, durability policy set to With a durability policy set, the requirements tighten a lot. I think we can combine the two points above in a generalized fashion. When performing One complication here is that we can't rely on the value of the durability policy at the time of failover, as the value could have been changed (e.g. from |
@arthurschreiber this is a good point. While it can't be 100% sure, this PR is relying on
The codes that cause
For context, these error codes to While imperfect, the gaps where this isn't a good signal seem pretty small here. I'm curious in what scenario we would get these error codes above from MySQL, but semi-sync is still running? Without digging I would say (ignoring remote-tablets again) just
One of the first steps reparenting code does is fetch the running durability policy from the topo. From there it assumes that is the correct policy. That's "usually" right, but again imperfect The edge cases I can see this not being true is the keyspace durability policy being changed seconds-before/during a reparent. Are there any other cases? I think that's out of scope of this PR but a very good point to address. If it's just the single scenario I mention, the risk of this occurring only exists for a handful of seconds, assuming VTOrc is successfully fixing things
Yes, this would likely be the most accurate approach, and one I began RFC'ing at Slack for similar reasons - using the tablet record for this state. An overall problem with ERS is it is used on both stateless VTCtlds and kind-of-stateful VTOrcs. VTOrc technically could store the state you refer to in it's backend database, but VTCtld has no such backend database. So we're left with just the topo as something both can use. The topo is probably ok, but it has it's inefficiencies at scale and this would add calls and more reliance on the topo being up - that said, reparents already rely heavily on the topo But again, I see the few seconds where the durability policy may not be 100% accurate as kind of out of scope, or at least this PR doesn't make that better or worse. What do you think? |
Co-authored-by: Arthur Schreiber <[email protected]> Signed-off-by: Tim Vaillancourt <[email protected]>
Signed-off-by: Tim Vaillancourt <[email protected]>
Signed-off-by: Tim Vaillancourt <[email protected]>
This reverts commit 19996f8.
Description
A follow-up to #18565 (the 2nd half of the change), this PR provides
EmergencyReparentShardthe required context to skip candidates replicas that havemysqldcrashed/down althoughvttabletremains upToday ANY candidate tablet of a shard having
mysqldcrashed/down at the time ofEmergencyReparentShardwill break the reparent on:😱
This problem affects both VTCtld and VTOrc ERS operations. Let's make sure
EmergencyReparentShardworks in this kind of emergency!This change raises the question: what if ALL candidates have
vttabletup,mysqlddown? The existing logic to check we found enough valid candidates catches this scenario the same as now - we just don't break the entire ERS operation on single failuresAnother question this raises: what happens to the tablet with MySQL down? This PR doesn't really address/change that. A replica in this state would look unhealthy to VTOrc, but it has no way to fix it. This broken tablet should get probably get replaced (Kube or other automation) - on the Kube operator this is built-in
Related Issue(s)
EmergencyReparentShardfails whenmysqldis down on any tablet in a shard #18528Checklist
Deployment Notes
AI Disclosure