Skip to content

ECH: Move nodes off allocator doc updated #1619

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

eedugon
Copy link
Contributor

@eedugon eedugon commented Jun 5, 2025

As described in #1527, this PR is promoting a knowledge article into our existing doc, per @kunisen and support team request.

Preview:

Changes:

  • Title updated
  • Email notifications section fixed (it wasn't valid).
  • Content of mentioned KB integrated into the doc.

Links to existing KB:

Closes #1527

@eedugon eedugon marked this pull request as ready for review June 5, 2025 10:09
@eedugon eedugon requested review from a team as code owners June 5, 2025 10:09
@kunisen

This comment was marked as outdated.

@eedugon
Copy link
Contributor Author

eedugon commented Jun 5, 2025

@kunisen , about the comment you have shared:

But one thing I just noticed, is maybe we could add a "Frequently Asked Questions (FAQs)" sub heading in the page so that readers can understand we included a bunch of FAQs.

The headings are in "Q&A" format style already, but that's something I wasn't sure if it was the right approach, and I wanted to double check that with other docs folks.

I agree if the headings are kept in this Q&A format, then a "Frequently Asked Questions" heading would make all sense, but maybe we rewrite the headers to be in a different format.

cc: @shainaraskas , what would you say?

Copy link
Contributor

@jakommo jakommo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few small comments, but other than that LGTM!

@shainaraskas
Copy link
Collaborator

shainaraskas commented Jun 5, 2025

The headings are in "Q&A" format style already, but that's something I wasn't sure if it was the right approach, and I wanted to double check that with other docs folks.

I agree if the headings are kept in this Q&A format, then a "Frequently Asked Questions" heading would make all sense, but maybe we rewrite the headers to be in a different format.

this isn't really in our style (reasoning) and could be reworked

a couple of them should be removed (e.g. the support CTA), or integrated into the doc ("Could such a system maintenance be avoided or skipped?" should just be introductory information about why this happens and its inevitability)

some could be pulled into an "Availability during system maintenance" section and perhaps "Data loss risk for non-HA deployments"

some of them could be reworded ("How can I be notified when a node is changed?" > "Notifications for moved or changed nodes" [more task-based]).

I do think that if we want to keep these together, they do need a heading of their own so they're not nested below "Possible causes and impact"

@eedugon
Copy link
Contributor Author

eedugon commented Jun 5, 2025

@shainaraskas : I'll do some rework on this to avoid the FAQ style while keeping all the key points we want to communicate to the users. Thanks a lot for your feedback!

@kunisen
Copy link
Contributor

kunisen commented Jun 6, 2025

Thanks for being patient and all the help! 🙏

[1]

I made a bunch of updates based on internal ticket comments - https://github.com/elastic/support-tech-lead/issues/1576#issuecomment-2948156720.

Here's the preview:
https://docs-v3-preview.elastic.dev/elastic/docs-content/pull/1619/troubleshoot/monitoring/node-moves-outages

[2]

@eedugon I totally get what you and @shainaraskas said above #1619 (comment). Please feel free to make any updates from docs perspective based on your writing standard.

I still added FAQ heading because if we don't have this, it's logically unbalanced and not ready for being merged.
I know it's not clearing the docs criteria, but in case it takes long or great effort to reorganize the wordings, it's technically and logically ready for merge, which means we could do the merge first, and then think about the wording improvement next.

Again, please feel free to make your change even including the removal of that one.

[3]

Also, I believe it's technically clear now so no longer need to discuss anything further internally. But if still anything is technically unclear or regarding the expectation, let's still discuss it internally ha :)

@eedugon
Copy link
Contributor Author

eedugon commented Jun 10, 2025

@shainaraskas : I've worked on your suggestions and removed the FAQ style. I'm pretty happy with the outcome and final sections / sub-sections, let me know your thoughts.

I also updated some minor paragraphs and added a couple of introductory sentences that felt needed (mainly in performance considerations during system maintenance).

The content is 90% similar to the KB article but I think it reads better and it's organized by topic more than by questions.

@kunisen , please share your thoughts too!

@eedugon eedugon requested a review from shainaraskas June 10, 2025 09:44
@kunisen
Copy link
Contributor

kunisen commented Jun 10, 2025

Thanks @eedugon looks nice from my side - https://docs-v3-preview.elastic.dev/elastic/docs-content/pull/1619/troubleshoot/monitoring/node-moves-outages#why-data-loss-can-occur-even-with-multiple-zones tho I am still a bit unfamiliar with this non FAQ way, but let's try it.

Some small things:

Availability during node vacate

WDYT we say "Service availability during node vacate"

Why data loss can occur even with multiple zones

WDYT we say "Data loss risk without replica shards"?


If you are good with it, then I am good to merge :)

@eedugon
Copy link
Contributor Author

eedugon commented Jun 10, 2025

@kunisen , very good suggestion, next time feel free to add them directly in the code (as suggestions) and we can discuss them there.

I've applied the changes, thanks a lot!

@kunisen
Copy link
Contributor

kunisen commented Jun 11, 2025

Thanks @eedugon indeed I will use suggest next time. 🙏

@shainaraskas could you kindly help us double check if we are good to go please?
Once we merge the public doc PR, I will tweak a little of our KB to make it more adaptive to public doc.

Then I think we should be good to go :)


**What is the impact?**
This document explains the "`Move nodes off of allocator...`" message that appears on the [activity page](../../deploy-manage/deploy/elastic-cloud/keep-track-of-deployment-activity.md) in {{ech}} deployments, helping you understand its meaning, implications, and what to expect.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest splitting this apart so the error message is in its own codeblock and the full text is present. just put [allocatorname] or something as a placeholder


During the routine system maintenance, having replicas and multiple availability zones ensures minimal interruption to your service. When nodes are vacated, as long as you have high availability, all search and indexing requests are expected to work within the reduced capacity until the node is back to normal.
![Move nodes off allocator](images/move_nodes_ech_allocator.jpeg)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this screenshot adds value if we share the entire error message on the page. it's also very small and hard to read so I'd prefer to skip it


To ensure that your nodes are located on healthy hosts, we vacate nodes to perform routine system maintenance or to remove a host with hardware issues from service.
To ensure that your deployment nodes are located on healthy hosts, we vacate nodes to perform essential system maintenance or to remove a host with hardware issues from service. These tasks cannot be skipped or delayed.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
To ensure that your deployment nodes are located on healthy hosts, we vacate nodes to perform essential system maintenance or to remove a host with hardware issues from service. These tasks cannot be skipped or delayed.
To ensure that your deployment nodes are located on healthy hosts, Elastic vacates nodes to perform essential system maintenance or to remove a host with hardware issues from service. These tasks cannot be skipped or delayed.


To ensure that your nodes are located on healthy hosts, we vacate nodes to perform routine system maintenance or to remove a host with hardware issues from service.
To ensure that your deployment nodes are located on healthy hosts, we vacate nodes to perform essential system maintenance or to remove a host with hardware issues from service. These tasks cannot be skipped or delayed.

All major scheduled maintenance and incidents can be found on the Elastic [status page](https://status.elastic.co/). You can subscribe to that page to be notified about updates.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"be notified about updates" is a little repetitive

Suggested change
All major scheduled maintenance and incidents can be found on the Elastic [status page](https://status.elastic.co/). You can subscribe to that page to be notified about updates.
All major scheduled maintenance and incidents can be found on the Elastic [status page](https://status.elastic.co/). You can subscribe to that page to be notified about planned maintenance or actions that have been taken to respond to incidents.


1. Follow the first five steps in [Getting notified about deployment health issues](../../deploy-manage/monitor/monitoring-data/configure-stack-monitoring-alerts.md).
2. At Step 6, to choose the alert type for when a node is changed, select **CLUSTER HEALTH** → **Nodes changed** → **Edit alert**.
Potential causes of system maintenance include, but not limited to, situations like:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Potential causes of system maintenance include, but not limited to, situations like:
Potential causes of system maintenance include, but not limited to, situations like the following:

Comment on lines +75 to 79
::::{admonition} Availability zones and performance
Increasing the number of zones should not be used to add more resources. The concept of zones is meant for High Availability (2 zones) and Fault Tolerance (3 zones), but neither will work if the cluster relies on the resources from those zones to be operational.

The recommendation is to **scale up the resources within a single zone until the cluster can take the full load (add some buffer to be prepared for a peak of requests)**, then scale out by adding additional zones depending on your requirements: 2 zones for High Availability, 3 zones for Fault Tolerance.
::::
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

be careful about repeating info that's elsewhere - this concept is something we should probably use a snippet for

you should also avoid bolding and brackets generally. "high availability" and "fault tolerance" also do not need Title Case.

"the recommendation is" is not an ideal sentence struture. Try "You should [blank]"


1. Enable [Stack monitoring](/deploy-manage/monitor/stack-monitoring/ece-ech-stack-monitoring.md#enable-logging-and-monitoring-steps) (logs and metrics) on your deployment. Only metrics collection is required for these notifications to work.

In the deployment used as the destination of Stack monitoring:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to be integrated into the list of steps. this should be step 2 and steps 2-4 should be made children


If a node’s host experiences an outage, the system automatically vacates the node and displays a related `Don't attempt to gracefully move shards` message on the [**Activity**](../../deploy-manage/deploy/elastic-cloud/keep-track-of-deployment-activity.md) page. Since the node is unavailable, the system skips checks that ensure the node’s shards have been moved before shutting down the node.
3. (Optional) Configure an email [connector](/deploy-manage/manage-connectors.md). If you prefer, use the pre-configured `Elastic-CLoud-SMTP`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably link here instead

Suggested change
3. (Optional) Configure an email [connector](/deploy-manage/manage-connectors.md). If you prefer, use the pre-configured `Elastic-CLoud-SMTP`.
3. (Optional) Configure an email [connector](kibana://connectors-kibana/email-action-type.md). If you prefer, use the preconfigured `Elastic-Cloud-SMTP` email connector.


Unless overridden or unable, the system will automatically recover the vacated node’s data from replicas or snapshots. If your cluster has high availability, all search and indexing requests should work within the reduced capacity until the node is back to normal.
4. Edit the rule **Cluster alerting** → **{{es}} nodes changed** and select the email connector.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this only true if you set up the email connector? or is the previous step meant to be "if you haven't already"

Suggested change
4. Edit the rule **Cluster alerting** **{{es}} nodes changed** and select the email connector.
4. Edit the rule **Cluster alerting** > **{{es}} nodes changed** and select the email connector.

4. Edit the rule **Cluster alerting** → **{{es}} nodes changed** and select the email connector.

::::{note}
If you have only one master node in your cluster, during the master node vacate no notification will be sent. Kibana needs to communicate with the master node in order to send a notification. One way to avoid this is by shipping your deployment metrics to a dedicated monitoring cluster when you enable logging and monitoring.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If you have only one master node in your cluster, during the master node vacate no notification will be sent. Kibana needs to communicate with the master node in order to send a notification. One way to avoid this is by shipping your deployment metrics to a dedicated monitoring cluster when you enable logging and monitoring.
If you have only one master node in your cluster, no notification will be sent during the master node vacate. {{kib}} needs to communicate with the master node in order to send a notification. One way to avoid this is by shipping your deployment metrics to a dedicated monitoring cluster when you enable logging and monitoring.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this note also depend on having a self-monitoring setup?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ECH - Promote KB about allocator moving nodes due to essential system maintenance
5 participants