-
Notifications
You must be signed in to change notification settings - Fork 99
ECH: Move nodes off allocator doc updated #1619
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
@kunisen , about the comment you have shared:
The headings are in "Q&A" format style already, but that's something I wasn't sure if it was the right approach, and I wanted to double check that with other docs folks. I agree if the headings are kept in this Q&A format, then a "Frequently Asked Questions" heading would make all sense, but maybe we rewrite the headers to be in a different format. cc: @shainaraskas , what would you say? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few small comments, but other than that LGTM!
this isn't really in our style (reasoning) and could be reworked a couple of them should be removed (e.g. the support CTA), or integrated into the doc ("Could such a system maintenance be avoided or skipped?" should just be introductory information about why this happens and its inevitability) some could be pulled into an "Availability during system maintenance" section and perhaps "Data loss risk for non-HA deployments" some of them could be reworded ("How can I be notified when a node is changed?" > "Notifications for moved or changed nodes" [more task-based]). I do think that if we want to keep these together, they do need a heading of their own so they're not nested below "Possible causes and impact" |
Co-authored-by: Stef Nestor <[email protected]>
@shainaraskas : I'll do some rework on this to avoid the FAQ style while keeping all the key points we want to communicate to the users. Thanks a lot for your feedback! |
Thanks for being patient and all the help! 🙏 [1]I made a bunch of updates based on internal ticket comments - https://github.com/elastic/support-tech-lead/issues/1576#issuecomment-2948156720. Here's the preview: [2]@eedugon I totally get what you and @shainaraskas said above #1619 (comment). Please feel free to make any updates from docs perspective based on your writing standard. I still added Again, please feel free to make your change even including the removal of that one. [3]Also, I believe it's technically clear now so no longer need to discuss anything further internally. But if still anything is technically unclear or regarding the expectation, let's still discuss it internally ha :) |
@shainaraskas : I've worked on your suggestions and removed the FAQ style. I'm pretty happy with the outcome and final sections / sub-sections, let me know your thoughts. I also updated some minor paragraphs and added a couple of introductory sentences that felt needed (mainly in The content is 90% similar to the KB article but I think it reads better and it's organized by topic more than by questions. @kunisen , please share your thoughts too! |
Thanks @eedugon looks nice from my side - https://docs-v3-preview.elastic.dev/elastic/docs-content/pull/1619/troubleshoot/monitoring/node-moves-outages#why-data-loss-can-occur-even-with-multiple-zones tho I am still a bit unfamiliar with this non FAQ way, but let's try it. Some small things: WDYT we say "Service availability during node vacate" WDYT we say "Data loss risk without replica shards"? If you are good with it, then I am good to merge :) |
@kunisen , very good suggestion, next time feel free to add them directly in the code (as I've applied the changes, thanks a lot! |
Thanks @eedugon indeed I will use suggest next time. 🙏 @shainaraskas could you kindly help us double check if we are good to go please? Then I think we should be good to go :) |
|
||
**What is the impact?** | ||
This document explains the "`Move nodes off of allocator...`" message that appears on the [activity page](../../deploy-manage/deploy/elastic-cloud/keep-track-of-deployment-activity.md) in {{ech}} deployments, helping you understand its meaning, implications, and what to expect. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggest splitting this apart so the error message is in its own codeblock and the full text is present. just put [allocatorname] or something as a placeholder
|
||
During the routine system maintenance, having replicas and multiple availability zones ensures minimal interruption to your service. When nodes are vacated, as long as you have high availability, all search and indexing requests are expected to work within the reduced capacity until the node is back to normal. | ||
 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this screenshot adds value if we share the entire error message on the page. it's also very small and hard to read so I'd prefer to skip it
|
||
To ensure that your nodes are located on healthy hosts, we vacate nodes to perform routine system maintenance or to remove a host with hardware issues from service. | ||
To ensure that your deployment nodes are located on healthy hosts, we vacate nodes to perform essential system maintenance or to remove a host with hardware issues from service. These tasks cannot be skipped or delayed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To ensure that your deployment nodes are located on healthy hosts, we vacate nodes to perform essential system maintenance or to remove a host with hardware issues from service. These tasks cannot be skipped or delayed. | |
To ensure that your deployment nodes are located on healthy hosts, Elastic vacates nodes to perform essential system maintenance or to remove a host with hardware issues from service. These tasks cannot be skipped or delayed. |
|
||
To ensure that your nodes are located on healthy hosts, we vacate nodes to perform routine system maintenance or to remove a host with hardware issues from service. | ||
To ensure that your deployment nodes are located on healthy hosts, we vacate nodes to perform essential system maintenance or to remove a host with hardware issues from service. These tasks cannot be skipped or delayed. | ||
|
||
All major scheduled maintenance and incidents can be found on the Elastic [status page](https://status.elastic.co/). You can subscribe to that page to be notified about updates. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"be notified about updates" is a little repetitive
All major scheduled maintenance and incidents can be found on the Elastic [status page](https://status.elastic.co/). You can subscribe to that page to be notified about updates. | |
All major scheduled maintenance and incidents can be found on the Elastic [status page](https://status.elastic.co/). You can subscribe to that page to be notified about planned maintenance or actions that have been taken to respond to incidents. |
|
||
1. Follow the first five steps in [Getting notified about deployment health issues](../../deploy-manage/monitor/monitoring-data/configure-stack-monitoring-alerts.md). | ||
2. At Step 6, to choose the alert type for when a node is changed, select **CLUSTER HEALTH** → **Nodes changed** → **Edit alert**. | ||
Potential causes of system maintenance include, but not limited to, situations like: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential causes of system maintenance include, but not limited to, situations like: | |
Potential causes of system maintenance include, but not limited to, situations like the following: |
::::{admonition} Availability zones and performance | ||
Increasing the number of zones should not be used to add more resources. The concept of zones is meant for High Availability (2 zones) and Fault Tolerance (3 zones), but neither will work if the cluster relies on the resources from those zones to be operational. | ||
|
||
The recommendation is to **scale up the resources within a single zone until the cluster can take the full load (add some buffer to be prepared for a peak of requests)**, then scale out by adding additional zones depending on your requirements: 2 zones for High Availability, 3 zones for Fault Tolerance. | ||
:::: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
be careful about repeating info that's elsewhere - this concept is something we should probably use a snippet for
you should also avoid bolding and brackets generally. "high availability" and "fault tolerance" also do not need Title Case.
"the recommendation is" is not an ideal sentence struture. Try "You should [blank]"
|
||
1. Enable [Stack monitoring](/deploy-manage/monitor/stack-monitoring/ece-ech-stack-monitoring.md#enable-logging-and-monitoring-steps) (logs and metrics) on your deployment. Only metrics collection is required for these notifications to work. | ||
|
||
In the deployment used as the destination of Stack monitoring: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs to be integrated into the list of steps. this should be step 2 and steps 2-4 should be made children
|
||
If a node’s host experiences an outage, the system automatically vacates the node and displays a related `Don't attempt to gracefully move shards` message on the [**Activity**](../../deploy-manage/deploy/elastic-cloud/keep-track-of-deployment-activity.md) page. Since the node is unavailable, the system skips checks that ensure the node’s shards have been moved before shutting down the node. | ||
3. (Optional) Configure an email [connector](/deploy-manage/manage-connectors.md). If you prefer, use the pre-configured `Elastic-CLoud-SMTP`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd probably link here instead
3. (Optional) Configure an email [connector](/deploy-manage/manage-connectors.md). If you prefer, use the pre-configured `Elastic-CLoud-SMTP`. | |
3. (Optional) Configure an email [connector](kibana://connectors-kibana/email-action-type.md). If you prefer, use the preconfigured `Elastic-Cloud-SMTP` email connector. |
|
||
Unless overridden or unable, the system will automatically recover the vacated node’s data from replicas or snapshots. If your cluster has high availability, all search and indexing requests should work within the reduced capacity until the node is back to normal. | ||
4. Edit the rule **Cluster alerting** → **{{es}} nodes changed** and select the email connector. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this only true if you set up the email connector? or is the previous step meant to be "if you haven't already"
4. Edit the rule **Cluster alerting** → **{{es}} nodes changed** and select the email connector. | |
4. Edit the rule **Cluster alerting** > **{{es}} nodes changed** and select the email connector. |
4. Edit the rule **Cluster alerting** → **{{es}} nodes changed** and select the email connector. | ||
|
||
::::{note} | ||
If you have only one master node in your cluster, during the master node vacate no notification will be sent. Kibana needs to communicate with the master node in order to send a notification. One way to avoid this is by shipping your deployment metrics to a dedicated monitoring cluster when you enable logging and monitoring. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you have only one master node in your cluster, during the master node vacate no notification will be sent. Kibana needs to communicate with the master node in order to send a notification. One way to avoid this is by shipping your deployment metrics to a dedicated monitoring cluster when you enable logging and monitoring. | |
If you have only one master node in your cluster, no notification will be sent during the master node vacate. {{kib}} needs to communicate with the master node in order to send a notification. One way to avoid this is by shipping your deployment metrics to a dedicated monitoring cluster when you enable logging and monitoring. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this note also depend on having a self-monitoring setup?
As described in #1527, this PR is promoting a knowledge article into our existing doc, per @kunisen and support team request.
Preview:
Changes:
Links to existing KB:
Closes #1527