Skip to content

Segmentation and other faults on GKE Autopilot clusters #5842

Open
@chipkent

Description

@chipkent

I have been running emissary-ingress for a few years. Something has changed recently, either in emissary-ingress or GKE that has broken emissary. Emissary will now not come up and gets in a loop where it fails and then retries.

My entire setup is Terraform, so it is extremely repeatable.

The observed behavior started a few days ago. Let's say it happened some time after 2025-05-20. Possibly GKE updated the autopilot cluster Kube version under me. The Terraform below upgrades me to the latest GKE Kube version. The problem is present at least on this version and the prior version.

v8.9.1 had similar problems. It additionally had an exception where it attempted to allocate 103TB of RAM -- or something else outrageous, and then failed.

Expected behavior
I expect emissary to come up and function without segfaulting or repeatedly restarting.

Versions (please complete the following information):

  • emissary-ingress: [8.9.1, 8.12.2]
  • Kubernetes environment [GKE Autopilot 1.32.4-gke.1106006, 1.32.3-gke.1927009]

Logs

Here are some potentially useful log messages:

Last normal-looking log message:

time="2025-05-28 20:18:31.5586" level=info msg="Pushing snapshot v1" func=github.com/datawire/emissary/v3/pkg/ambex.updaterWithTicker file="/go/pkg/ambex/ratelimit.go:159" CMD=entrypoint PID=1 THREAD=/ambex/updater

Segfault log message:

[2025-05-28 20:18:31.640][25][critical][backtrace] [./source/server/backtrace.h:127] Caught Segmentation fault, suspect faulting address 0x0
2025-05-28 13:18:31.641 PDT
[2025-05-28 20:18:31.641][25][critical][backtrace] [./source/server/backtrace.h:111] Backtrace (use tools/stack_decode.py to get line numbers):
2025-05-28 13:18:31.641 PDT
[2025-05-28 20:18:31.641][25][critical][backtrace] [./source/server/backtrace.h:112] Envoy version: 628f5afc75a894a08504fa0f416269ec50c07bf9/1.31.4-dev/Clean/RELEASE/BoringSSL
2025-05-28 13:18:31.642 PDT
[2025-05-28 20:18:31.642][25][critical][backtrace] [./source/server/backtrace.h:121] #9: [0x5ab789dfe2a5]
2025-05-28 13:18:31.642 PDT
[2025-05-28 20:18:31.642][25][critical][backtrace] [./source/server/backtrace.h:121] #10: [0x5ab78d41c9ce]
2025-05-28 13:18:31.642 PDT
[2025-05-28 20:18:31.642][25][critical][backtrace] [./source/server/backtrace.h:121] #11: [0x5ab78b61d1a2]
2025-05-28 13:18:31.643 PDT
[2025-05-28 20:18:31.643][25][critical][backtrace] [./source/server/backtrace.h:121] #12: [0x5ab78b61d0df]
2025-05-28 13:18:31.643 PDT
[2025-05-28 20:18:31.643][25][critical][backtrace] [./source/server/backtrace.h:121] #13: [0x5ab78b61e168]
2025-05-28 13:18:31.643 PDT
[2025-05-28 20:18:31.643][25][critical][backtrace] [./source/server/backtrace.h:121] #14: [0x5ab78b61d31f]
2025-05-28 13:18:31.643 PDT
[2025-05-28 20:18:31.643][25][critical][backtrace] [./source/server/backtrace.h:121] #15: [0x5ab78b5cb968]
2025-05-28 13:18:31.643 PDT
[2025-05-28 20:18:31.643][25][critical][backtrace] [./source/server/backtrace.h:121] #16: [0x5ab78b5cce51]
2025-05-28 13:18:31.643 PDT
[2025-05-28 20:18:31.643][25][critical][backtrace] [./source/server/backtrace.h:121] #17: [0x5ab78b5ca74c]
2025-05-28 13:18:31.643 PDT
[2025-05-28 20:18:31.643][25][critical][backtrace] [./source/server/backtrace.h:121] #18: [0x5ab78b5cb06e]
2025-05-28 13:18:31.643 PDT
[2025-05-28 20:18:31.643][25][critical][backtrace] [./source/server/backtrace.h:121] #19: [0x5ab78b5cb1fc]
2025-05-28 13:18:31.643 PDT
[2025-05-28 20:18:31.643][25][critical][backtrace] [./source/server/backtrace.h:121] #20: [0x5ab789dc314c]
2025-05-28 13:18:31.643 PDT
[2025-05-28 20:18:31.643][25][critical][backtrace] [./source/server/backtrace.h:121] #21: [0x784f2f0b5510]

More Segfault log entries:

time="2025-05-28 20:18:31.6466" level=info msg="finished with error: signal: segmentation fault" func="github.com/datawire/dlib/dexec.(*Cmd).Wait" file="/go/vendor/github.com/datawire/dlib/dexec/cmd.go:257" CMD=entrypoint PID=1 THREAD=/envoy dexec.pid=25
time="2025-05-28 20:18:31.6467" level=error msg="goroutine \"/envoy\" exited with error: signal: segmentation fault" func="github.com/datawire/dlib/dgroup.(*Group).goWorkerCtx.func1.1" file="/go/vendor/github.com/datawire/dlib/dgroup/group.go:380" CMD=entrypoint PID=1 THREAD=/envoy
time="2025-05-28 20:18:32.0499" level=error msg="shut down with error error: signal: segmentation fault" func=github.com/datawire/emissary/v3/pkg/busy.Main file="/go/pkg/busy/busy.go:87" CMD=entrypoint PID=1

All of the error-level logs, possibly truncated for width:

2025-05-28 13:18:31.949 PDT
[2025-05-28 20:18:31 +0000] [17] [INFO] Shutting down: Master
2025-05-28 13:18:32.049 PDT
time="2025-05-28 20:18:32.0494" level=info msg="finished successfully: exit status 0" func="github.com/datawire/dlib/dexec.(*Cmd).Wait" file="/go/vendor/github.com/datawire/dlib/dexec/cmd.go:255" CMD=entrypoint PID=1 THREAD=/diagd dexec.pid=17
2025-05-28 13:18:32.049 PDT
time="2025-05-28 20:18:32.0496" level=info msg=" final goroutine statuses:" func=github.com/datawire/dlib/dgroup.logGoroutineStatuses file="/go/vendor/github.com/datawire/dlib/dgroup/group.go:84" CMD=entrypoint PID=1 THREAD=":shutdown_status"
2025-05-28 13:18:32.049 PDT
time="2025-05-28 20:18:32.0497" level=info msg=" /ambex : exited" func=github.com/datawire/dlib/dgroup.logGoroutineStatuses file="/go/vendor/github.com/datawire/dlib/dgroup/group.go:95" CMD=entrypoint PID=1 THREAD=":shutdown_status"
2025-05-28 13:18:32.049 PDT
time="2025-05-28 20:18:32.0497" level=info msg=" /diagd : exited" func=github.com/datawire/dlib/dgroup.logGoroutineStatuses file="/go/vendor/github.com/datawire/dlib/dgroup/group.go:95" CMD=entrypoint PID=1 THREAD=":shutdown_status"
2025-05-28 13:18:32.049 PDT
time="2025-05-28 20:18:32.0497" level=info msg=" /envoy : exited with error" func=github.com/datawire/dlib/dgroup.logGoroutineStatuses file="/go/vendor/github.com/datawire/dlib/dgroup/group.go:95" CMD=entrypoint PID=1 THREAD=":shutdown_status"
2025-05-28 13:18:32.049 PDT
time="2025-05-28 20:18:32.0498" level=info msg=" /external_snapshot_server: exited" func=github.com/datawire/dlib/dgroup.logGoroutineStatuses file="/go/vendor/github.com/datawire/dlib/dgroup/group.go:95" CMD=entrypoint PID=1 THREAD=":shutdown_status"
2025-05-28 13:18:32.049 PDT
time="2025-05-28 20:18:32.0498" level=info msg=" /healthchecks : exited" func=github.com/datawire/dlib/dgroup.logGoroutineStatuses file="/go/vendor/github.com/datawire/dlib/dgroup/group.go:95" CMD=entrypoint PID=1 THREAD=":shutdown_status"
2025-05-28 13:18:32.049 PDT
time="2025-05-28 20:18:32.0498" level=info msg=" /memory : exited" func=github.com/datawire/dlib/dgroup.logGoroutineStatuses file="/go/vendor/github.com/datawire/dlib/dgroup/group.go:95" CMD=entrypoint PID=1 THREAD=":shutdown_status"
2025-05-28 13:18:32.049 PDT
time="2025-05-28 20:18:32.0499" level=info msg=" /snapshot_server : exited" func=github.com/datawire/dlib/dgroup.logGoroutineStatuses file="/go/vendor/github.com/datawire/dlib/dgroup/group.go:95" CMD=entrypoint PID=1 THREAD=":shutdown_status"
2025-05-28 13:18:32.049 PDT
time="2025-05-28 20:18:32.0499" level=info msg=" /watcher : exited" func=github.com/datawire/dlib/dgroup.logGoroutineStatuses file="/go/vendor/github.com/datawire/dlib/dgroup/group.go:95" CMD=entrypoint PID=1 THREAD=":shutdown_status"
2025-05-28 13:18:32.050 PDT
time="2025-05-28 20:18:32.0499" level=error msg="shut down with error error: signal: segmentation fault" func=github.com/datawire/emissary/v3/pkg/busy.Main file="/go/pkg/busy/busy.go:87" CMD=entrypoint 

Setup Details

CRDs:

  crd_url = "https://app.getambassador.io/yaml/emissary/3.9.1/emissary-crds.yaml"

Cluster Creation:

resource "google_container_cluster" "default" {
  name     = var.cluster_name
  description = "Cluster: ${var.cluster_name}"
  location = var.region
  deletion_protection = var.cluster_deletion_protection

  min_master_version = "1.32.4-gke.1106006"

  // label the resources for accounting
  resource_labels = {
    cluster-role = var.cluster_role
    cluster-name = var.cluster_name
  }
 
  network    = var.network
  subnetwork = google_compute_subnetwork.default.name
 
   # See https://github.com/hashicorp/terraform-provider-google/issues/10782
  ip_allocation_policy {
  }

  # See https://github.com/hashicorp/terraform-provider-google/issues/15454
  lifecycle {
    ignore_changes = [ dns_config, gateway_api_config ]
  }

  enable_autopilot = true
}

Emissary Install:

variable "chart_version" {
  description = "emissary chart version"
  # default = "v8.9.1"
  default = "v8.12.2"
}

resource "helm_release" "emissary" {
  name = "emissary-ingress"
  namespace = kubernetes_namespace.emissary.id
  repository = "https://app.getambassador.io"
  chart = "emissary-ingress"
  version = var.chart_version

  // this was testing to see if increasing memory limits helped -- it didn't
  set {
    name  = "resources.limits.memory"
    value = "4Gi"
  }
  set {
    name  = "resources.requests.memory"
    value = "4Gi"
  }
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    t:bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions