Skip to content

Conversation

@timvaillancourt
Copy link
Contributor

@timvaillancourt timvaillancourt commented Nov 11, 2025

Description

A follow-up to #18565 (the 2nd half of the change), this PR provides EmergencyReparentShard the required context to skip candidates replicas that have mysqld crashed/down although vttablet remains up

Today ANY candidate tablet of a shard having mysqld crashed/down at the time of EmergencyReparentShard will break the reparent on:

rpc error: code = Unknown desc = failed to stop replication and build status maps: error when getting replication status for alias zone1-0000000102: Code: UNAVAILABLE
rpc error: code = Unavailable desc = TabletManager.StopReplicationAndGetStatus on zone1-0000000102: before status failed: net.Dial(/Users/tim/github/timvaillancourt/vitess/vtdataroot/vtroot_8301/vt_0000000102/mysql.sock) to local server failed: dial unix /Users/tim/github/timvaillancourt/vitess/vtdataroot/vtroot_8301/vt_0000000102/mysql.sock: connect: no such file or directory (errno 2002) (sqlstate HY000)

😱

This problem affects both VTCtld and VTOrc ERS operations. Let's make sure EmergencyReparentShard works in this kind of emergency!

This change raises the question: what if ALL candidates have vttablet up, mysqld down? The existing logic to check we found enough valid candidates catches this scenario the same as now - we just don't break the entire ERS operation on single failures

Another question this raises: what happens to the tablet with MySQL down? This PR doesn't really address/change that. A replica in this state would look unhealthy to VTOrc, but it has no way to fix it. This broken tablet should get probably get replaced (Kube or other automation) - on the Kube operator this is built-in

Related Issue(s)

Checklist

  • "Backport to:" labels have been added if this change should be back-ported to release branches
  • If this change is to be back-ported to previous releases, a justification is included in the PR description
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on CI?
  • Documentation was added or is not required

Deployment Notes

AI Disclosure

@timvaillancourt timvaillancourt added Type: Enhancement Logical improvement (somewhere between a bug and feature) Component: vtctl Component: VTTablet labels Nov 11, 2025
@vitess-bot
Copy link
Contributor

vitess-bot bot commented Nov 11, 2025

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • Ensure there is a link to an issue (except for internal cleanup and flaky test fixes), new features should have an RFC that documents use cases and test cases.

Tests

  • Bug fixes should have at least one unit or end-to-end test, enhancement and new features should have a sufficient number of tests.

Documentation

  • Apply the release notes (needs details) label if users need to know about this change.
  • New features should be documented.
  • There should be some code comments as to why things are implemented the way they are.
  • There should be a comment at the top of each new or modified test to explain what the test does.

New flags

  • Is this flag really necessary?
  • Flag names must be clear and intuitive, use dashes (-), and have a clear help text.

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow needs to be marked as required, the maintainer team must be notified.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from vitess-operator and arewefastyet, if used there.
  • vtctl command output order should be stable and awk-able.

@vitess-bot vitess-bot bot added NeedsBackportReason If backport labels have been applied to a PR, a justification is required NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsWebsiteDocsUpdate What it says labels Nov 11, 2025
@github-actions github-actions bot added this to the v24.0.0 milestone Nov 11, 2025
@timvaillancourt timvaillancourt removed NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsWebsiteDocsUpdate What it says NeedsIssue A linked issue is missing for this Pull Request NeedsBackportReason If backport labels have been applied to a PR, a justification is required labels Nov 11, 2025
Signed-off-by: Tim Vaillancourt <[email protected]>
var args []string
if timeout != "" {
args = append(args, "--action_timeout", timeout)
args = append(args, "--action-timeout", timeout)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This resolves a warning because of underscore deprecation

Signed-off-by: Tim Vaillancourt <[email protected]>
@codecov
Copy link

codecov bot commented Nov 12, 2025

Codecov Report

❌ Patch coverage is 50.00000% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 69.77%. Comparing base (1f49de4) to head (034d040).
⚠️ Report is 7 commits behind head on main.

Files with missing lines Patch % Lines
go/vt/vtctl/reparentutil/replication.go 50.00% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #18896      +/-   ##
==========================================
- Coverage   69.77%   69.77%   -0.01%     
==========================================
  Files        1608     1608              
  Lines      214908   214938      +30     
==========================================
+ Hits       149953   149967      +14     
- Misses      64955    64971      +16     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Signed-off-by: Tim Vaillancourt <[email protected]>
Signed-off-by: Tim Vaillancourt <[email protected]>
Signed-off-by: Tim Vaillancourt <[email protected]>
@timvaillancourt timvaillancourt added the Component: VTorc Vitess Orchestrator integration label Nov 12, 2025
Signed-off-by: Tim Vaillancourt <[email protected]>
@timvaillancourt timvaillancourt changed the title EmergencyReparentShard: support reachable replicas w/mysqld down EmergencyReparentShard: support reachable replica tablets w/mysqld down Nov 12, 2025
Signed-off-by: Tim Vaillancourt <[email protected]>
Signed-off-by: Tim Vaillancourt <[email protected]>
Signed-off-by: Tim Vaillancourt <[email protected]>
@timvaillancourt timvaillancourt marked this pull request as ready for review November 12, 2025 17:14
Signed-off-by: Tim Vaillancourt <[email protected]>
Signed-off-by: Tim Vaillancourt <[email protected]>
Copy link
Member

@mattlord mattlord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just some minor comments. Can you please add another test case to the Test_stopReplicationAndBuildStatusMaps unit test? Unless there's not really a good way to do that?

Comment on lines +259 to +261
// we prioritize completing the reparent (availability) for the common case. If this edge case were
// to occur, errant GTID(s) will be produced; if this happens often we should return UNAVAILABLE
// from vttablet using more detailed criteria (check the pidfile + running PID, etc).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it's not worth improving this case today? The lack of detail may have been in place simply because it did not impact any operations. But now we're building logic around the meaning we infer from the response. I'd say we should do this now, provided you have an idea how to do it.
What do we get in the error? I wonder if it's not an SQLError we can extract that maps to one of these: https://dev.mysql.com/doc/refman/en/gone-away.html
You can see elsewhere in the code base where we look to see if the error contains an SQL error (a MySQL error, and if so, if the code matches 1 or more MySQL error codes). Actually... just below this here 😆

@arthurschreiber
Copy link
Member

arthurschreiber commented Nov 17, 2025

So, I'm trying to wrap my head around what this change effectively means.

What if the replica that's down is the most advanced replica? You kinda hinted to it above, but does that mean we have a potential for data loss (i.e. changes that were acked by the replica that's now down and have not been replicated anywhere else might actually be lost)?

The current behavior is to fail ERS if we couldn't guarantee that all replicas can be stopped successfully (which implies that all replicas need to be healthy to perform an ERS). If we can't guarantee that, manual intervention is required because we don't want to risk losing any data and it's impossible to say what the correct way forward is going to be. But if I understand correctly, this change now means that we'll force an ERS to happen anyway, potentially promoting a replica that didn't have the latest changes. 🤔

@timvaillancourt
Copy link
Contributor Author

timvaillancourt commented Nov 17, 2025

The current behavior is to fail ERS if we couldn't guarantee that all replicas can be stopped successfully (which implies that all replicas need to be healthy to perform an ERS). If we can't guarantee that, manual intervention is required because we don't want to risk losing any data and it's impossible to say what the correct way forward is going to be. But if I understand correctly, this change now means that we'll force an ERS to happen anyway, potentially promoting a replica that didn't have the latest changes.

@arthurschreiber correct, unfortunately. My understanding is this is the same tradeoff was made in old-Orchestrator, cc @shlomi-noach to confirm (if it's still in cache).

What is being seen in the wild is the current behaviour prevents VTOrc (which uses ERS for many things) from remediating shards in partially-unhealthy states. So basically you end up in a situation with an indefinitely broken shard, due to a replica that is often unimportant. One could page a human here (we don't automate/document this) but I don't think that is the state of the art/elegant, so to speak

I'm glad you raise the point about manual intervention: erroring for a human to respond is viable on the vtctldclient EmergencyReparentShard command if we want to assume the caller of CLIs is a human that can "respond". I've seen cases where vtctldclient commands (or their RPCs) are used in non-human automation, however, so it's a hard call to make. But I guess what I'm saying is we could make this opt-in in specific areas, perhaps just VTOrc 🤔

@arthurschreiber
Copy link
Member

I've been thinking about this some more, and I think there's some fundamental issues with EmergencyReparentShard, especially when run with semi-sync.

Without semi-sync enabled (so, durability policy set to none), I think the change you are proposing here makes sense, but we could go even further and allow EmergencyReparentShard to complete, no matter whether the old primary is actually reachable, and no matter the state of the replicas - as long as there's one replica that can be promoted.

With a durability policy set, the requirements tighten a lot. EmergencyReparentShard is not allowed to promote a replica to a primary, unless we can guarantee that all replicas that were ack-ing transactions have stopped replicating from the primary. Otherwise performing a failover runs the risk of introducing a split brain, which is what setting the durability policy to cross_cell is supposed to prevent from happening.

I think we can combine the two points above in a generalized fashion. When performing EmergencyReparentShard, Vitess needs to guarantee that all semi-sync enabled replicas have stopped replicating from the primary we're failing away from. For asynchronous replicas, we don't care at all about their state, and their availability should not have any impact on whether EmergencyReparentShard can be performed (or the success of that operation).

One complication here is that we can't rely on the value of the durability policy at the time of failover, as the value could have been changed (e.g. from none to cross_cell or the other way around), but it's only reflected in the cluster after PlannedReparentShard or EmergencyReparentShard has completed. And we unfortunately can't ask which mode a replica is running in during EmergencyReparentShard because the replicas could be unavailable. I think to model this cleanly, we need to store this information in the tablet record (e.g. before starting replication, we store whether a tablet is running with semi-sync enabled or not). With this information in the topo, EmergencyReparentShard can correctly determine the set of replicas that need to be available to perform an EmergencyReparentShard operation successfully, and which replicas aren't important for the success of the failover.

@timvaillancourt
Copy link
Contributor Author

timvaillancourt commented Dec 2, 2025

With a durability policy set, the requirements tighten a lot. EmergencyReparentShard is not allowed to promote a replica to a primary, unless we can guarantee that all replicas that were ack-ing transactions have stopped replicating from the primary.

@arthurschreiber this is a good point. While it can't be 100% sure, this PR is relying on vtrpcpb.Code_UNAVAILABLE for the StopReplicationAndGetStatus RPC to indicate MySQL is not running, which implies (imperfectly) that semi-sync is not running

UNAVAILABLE is returned for MySQL codes that are pretty strong signals MySQL is completely down, but to be fair, some could be a false signal. Ignoring "remote tablets", I think that usually will be accurate but it would be good to understand in what scenarios that is not true

The codes that cause UNAVAILABLE:

  • CRServerGone = ErrorCode(2006) - MySQL server has gone away 🟡
    • Potentially less-reliable, doesn't necessarily mean semi-sync is stopped. But right now we'd assume it is
  • ERServerShutdown = ErrorCode(1053) - Server shutdown in progress 🟡
    • Kind of likely semi-sync is stopped, but not 100%
    • Typical cause: MySQL is trying to kill running queries during shutdown. This is a risk if we have massive queries that take time to kill
    • Open question: at what stage does MySQL stop replication? Does 1053 mean replication is stopped? ❓
  • ERServerIsntAvailable = ErrorCode(3168) - Server isn't available 🟢
    • This error is returned when the MySQL port us up but the "server API" has not started
      • I find no examples of people reporting this code on Google. It makes sense because this scenario must exist for a very short window of time
      • Open question: does replication start before OR after the "server API"? ❓
  • CRConnectionError = ErrorCode(2002) - Failed to connect to the local MySQL server 🟢
    • This is the code Slack got reliably when mysqld is 100% down, never in-between
  • CRConnHostError = ErrorCode(2003) - Can't connect to MySQL server on '%s:%u' (%d) 🟡
    • This is the same as 2002, but for a hostname/port vs socket file
    • Unless a tablet is misconfigured this would mean "remote tablets". Not the typical case. Perhaps should be ignored

For context, these error codes to vtrpcpb.Code mappings have been in queryserver for a long time. For tabletmanager, it matched this mapping in v23+

While imperfect, the gaps where this isn't a good signal seem pretty small here. I'm curious in what scenario we would get these error codes above from MySQL, but semi-sync is still running? Without digging I would say (ignoring remote-tablets again) just CRServerGone/2006 🤔

One complication here is that we can't rely on the value of the durability policy at the time of failover, as the value could have been changed (e.g. from none to cross_cell or the other way around), but it's only reflected in the cluster after PlannedReparentShard or EmergencyReparentShard has completed.

One of the first steps reparenting code does is fetch the running durability policy from the topo. From there it assumes that is the correct policy. That's "usually" right, but again imperfect

The edge cases I can see this not being true is the keyspace durability policy being changed seconds-before/during a reparent. Are there any other cases? I think that's out of scope of this PR but a very good point to address. If it's just the single scenario I mention, the risk of this occurring only exists for a handful of seconds, assuming VTOrc is successfully fixing things

I think to model this cleanly, we need to store this information in the tablet record (e.g. before starting replication, we store whether a tablet is running with semi-sync enabled or not). With this information in the topo, EmergencyReparentShard can correctly determine the set of replicas that need to be available to perform an EmergencyReparentShard operation successfully, and which replicas aren't important for the success of the failover.

Yes, this would likely be the most accurate approach, and one I began RFC'ing at Slack for similar reasons - using the tablet record for this state. An overall problem with ERS is it is used on both stateless VTCtlds and kind-of-stateful VTOrcs. VTOrc technically could store the state you refer to in it's backend database, but VTCtld has no such backend database. So we're left with just the topo as something both can use. The topo is probably ok, but it has it's inefficiencies at scale and this would add calls and more reliance on the topo being up - that said, reparents already rely heavily on the topo

But again, I see the few seconds where the durability policy may not be 100% accurate as kind of out of scope, or at least this PR doesn't make that better or worse. What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Component: vtctl Component: VTorc Vitess Orchestrator integration Component: VTTablet Type: Enhancement Logical improvement (somewhere between a bug and feature)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug Report/RFC: EmergencyReparentShard fails when mysqld is down on any tablet in a shard

3 participants