Description
I have been running emissary-ingress
for a few years. Something has changed recently, either in emissary-ingress
or GKE that has broken emissary. Emissary will now not come up and gets in a loop where it fails and then retries.
My entire setup is Terraform, so it is extremely repeatable.
The observed behavior started a few days ago. Let's say it happened some time after 2025-05-20. Possibly GKE updated the autopilot cluster Kube version under me. The Terraform below upgrades me to the latest GKE Kube version. The problem is present at least on this version and the prior version.
v8.9.1 had similar problems. It additionally had an exception where it attempted to allocate 103TB of RAM -- or something else outrageous, and then failed.
Expected behavior
I expect emissary to come up and function without segfaulting or repeatedly restarting.
Versions (please complete the following information):
- emissary-ingress: [8.9.1, 8.12.2]
- Kubernetes environment [GKE Autopilot 1.32.4-gke.1106006, 1.32.3-gke.1927009]
Logs
Here are some potentially useful log messages:
Last normal-looking log message:
time="2025-05-28 20:18:31.5586" level=info msg="Pushing snapshot v1" func=github.com/datawire/emissary/v3/pkg/ambex.updaterWithTicker file="/go/pkg/ambex/ratelimit.go:159" CMD=entrypoint PID=1 THREAD=/ambex/updater
Segfault log message:
[2025-05-28 20:18:31.640][25][critical][backtrace] [./source/server/backtrace.h:127] Caught Segmentation fault, suspect faulting address 0x0
2025-05-28 13:18:31.641 PDT
[2025-05-28 20:18:31.641][25][critical][backtrace] [./source/server/backtrace.h:111] Backtrace (use tools/stack_decode.py to get line numbers):
2025-05-28 13:18:31.641 PDT
[2025-05-28 20:18:31.641][25][critical][backtrace] [./source/server/backtrace.h:112] Envoy version: 628f5afc75a894a08504fa0f416269ec50c07bf9/1.31.4-dev/Clean/RELEASE/BoringSSL
2025-05-28 13:18:31.642 PDT
[2025-05-28 20:18:31.642][25][critical][backtrace] [./source/server/backtrace.h:121] #9: [0x5ab789dfe2a5]
2025-05-28 13:18:31.642 PDT
[2025-05-28 20:18:31.642][25][critical][backtrace] [./source/server/backtrace.h:121] #10: [0x5ab78d41c9ce]
2025-05-28 13:18:31.642 PDT
[2025-05-28 20:18:31.642][25][critical][backtrace] [./source/server/backtrace.h:121] #11: [0x5ab78b61d1a2]
2025-05-28 13:18:31.643 PDT
[2025-05-28 20:18:31.643][25][critical][backtrace] [./source/server/backtrace.h:121] #12: [0x5ab78b61d0df]
2025-05-28 13:18:31.643 PDT
[2025-05-28 20:18:31.643][25][critical][backtrace] [./source/server/backtrace.h:121] #13: [0x5ab78b61e168]
2025-05-28 13:18:31.643 PDT
[2025-05-28 20:18:31.643][25][critical][backtrace] [./source/server/backtrace.h:121] #14: [0x5ab78b61d31f]
2025-05-28 13:18:31.643 PDT
[2025-05-28 20:18:31.643][25][critical][backtrace] [./source/server/backtrace.h:121] #15: [0x5ab78b5cb968]
2025-05-28 13:18:31.643 PDT
[2025-05-28 20:18:31.643][25][critical][backtrace] [./source/server/backtrace.h:121] #16: [0x5ab78b5cce51]
2025-05-28 13:18:31.643 PDT
[2025-05-28 20:18:31.643][25][critical][backtrace] [./source/server/backtrace.h:121] #17: [0x5ab78b5ca74c]
2025-05-28 13:18:31.643 PDT
[2025-05-28 20:18:31.643][25][critical][backtrace] [./source/server/backtrace.h:121] #18: [0x5ab78b5cb06e]
2025-05-28 13:18:31.643 PDT
[2025-05-28 20:18:31.643][25][critical][backtrace] [./source/server/backtrace.h:121] #19: [0x5ab78b5cb1fc]
2025-05-28 13:18:31.643 PDT
[2025-05-28 20:18:31.643][25][critical][backtrace] [./source/server/backtrace.h:121] #20: [0x5ab789dc314c]
2025-05-28 13:18:31.643 PDT
[2025-05-28 20:18:31.643][25][critical][backtrace] [./source/server/backtrace.h:121] #21: [0x784f2f0b5510]
More Segfault log entries:
time="2025-05-28 20:18:31.6466" level=info msg="finished with error: signal: segmentation fault" func="github.com/datawire/dlib/dexec.(*Cmd).Wait" file="/go/vendor/github.com/datawire/dlib/dexec/cmd.go:257" CMD=entrypoint PID=1 THREAD=/envoy dexec.pid=25
time="2025-05-28 20:18:31.6467" level=error msg="goroutine \"/envoy\" exited with error: signal: segmentation fault" func="github.com/datawire/dlib/dgroup.(*Group).goWorkerCtx.func1.1" file="/go/vendor/github.com/datawire/dlib/dgroup/group.go:380" CMD=entrypoint PID=1 THREAD=/envoy
time="2025-05-28 20:18:32.0499" level=error msg="shut down with error error: signal: segmentation fault" func=github.com/datawire/emissary/v3/pkg/busy.Main file="/go/pkg/busy/busy.go:87" CMD=entrypoint PID=1
All of the error-level logs, possibly truncated for width:
2025-05-28 13:18:31.949 PDT
[2025-05-28 20:18:31 +0000] [17] [INFO] Shutting down: Master
2025-05-28 13:18:32.049 PDT
time="2025-05-28 20:18:32.0494" level=info msg="finished successfully: exit status 0" func="github.com/datawire/dlib/dexec.(*Cmd).Wait" file="/go/vendor/github.com/datawire/dlib/dexec/cmd.go:255" CMD=entrypoint PID=1 THREAD=/diagd dexec.pid=17
2025-05-28 13:18:32.049 PDT
time="2025-05-28 20:18:32.0496" level=info msg=" final goroutine statuses:" func=github.com/datawire/dlib/dgroup.logGoroutineStatuses file="/go/vendor/github.com/datawire/dlib/dgroup/group.go:84" CMD=entrypoint PID=1 THREAD=":shutdown_status"
2025-05-28 13:18:32.049 PDT
time="2025-05-28 20:18:32.0497" level=info msg=" /ambex : exited" func=github.com/datawire/dlib/dgroup.logGoroutineStatuses file="/go/vendor/github.com/datawire/dlib/dgroup/group.go:95" CMD=entrypoint PID=1 THREAD=":shutdown_status"
2025-05-28 13:18:32.049 PDT
time="2025-05-28 20:18:32.0497" level=info msg=" /diagd : exited" func=github.com/datawire/dlib/dgroup.logGoroutineStatuses file="/go/vendor/github.com/datawire/dlib/dgroup/group.go:95" CMD=entrypoint PID=1 THREAD=":shutdown_status"
2025-05-28 13:18:32.049 PDT
time="2025-05-28 20:18:32.0497" level=info msg=" /envoy : exited with error" func=github.com/datawire/dlib/dgroup.logGoroutineStatuses file="/go/vendor/github.com/datawire/dlib/dgroup/group.go:95" CMD=entrypoint PID=1 THREAD=":shutdown_status"
2025-05-28 13:18:32.049 PDT
time="2025-05-28 20:18:32.0498" level=info msg=" /external_snapshot_server: exited" func=github.com/datawire/dlib/dgroup.logGoroutineStatuses file="/go/vendor/github.com/datawire/dlib/dgroup/group.go:95" CMD=entrypoint PID=1 THREAD=":shutdown_status"
2025-05-28 13:18:32.049 PDT
time="2025-05-28 20:18:32.0498" level=info msg=" /healthchecks : exited" func=github.com/datawire/dlib/dgroup.logGoroutineStatuses file="/go/vendor/github.com/datawire/dlib/dgroup/group.go:95" CMD=entrypoint PID=1 THREAD=":shutdown_status"
2025-05-28 13:18:32.049 PDT
time="2025-05-28 20:18:32.0498" level=info msg=" /memory : exited" func=github.com/datawire/dlib/dgroup.logGoroutineStatuses file="/go/vendor/github.com/datawire/dlib/dgroup/group.go:95" CMD=entrypoint PID=1 THREAD=":shutdown_status"
2025-05-28 13:18:32.049 PDT
time="2025-05-28 20:18:32.0499" level=info msg=" /snapshot_server : exited" func=github.com/datawire/dlib/dgroup.logGoroutineStatuses file="/go/vendor/github.com/datawire/dlib/dgroup/group.go:95" CMD=entrypoint PID=1 THREAD=":shutdown_status"
2025-05-28 13:18:32.049 PDT
time="2025-05-28 20:18:32.0499" level=info msg=" /watcher : exited" func=github.com/datawire/dlib/dgroup.logGoroutineStatuses file="/go/vendor/github.com/datawire/dlib/dgroup/group.go:95" CMD=entrypoint PID=1 THREAD=":shutdown_status"
2025-05-28 13:18:32.050 PDT
time="2025-05-28 20:18:32.0499" level=error msg="shut down with error error: signal: segmentation fault" func=github.com/datawire/emissary/v3/pkg/busy.Main file="/go/pkg/busy/busy.go:87" CMD=entrypoint
Setup Details
CRDs:
crd_url = "https://app.getambassador.io/yaml/emissary/3.9.1/emissary-crds.yaml"
Cluster Creation:
resource "google_container_cluster" "default" {
name = var.cluster_name
description = "Cluster: ${var.cluster_name}"
location = var.region
deletion_protection = var.cluster_deletion_protection
min_master_version = "1.32.4-gke.1106006"
// label the resources for accounting
resource_labels = {
cluster-role = var.cluster_role
cluster-name = var.cluster_name
}
network = var.network
subnetwork = google_compute_subnetwork.default.name
# See https://github.com/hashicorp/terraform-provider-google/issues/10782
ip_allocation_policy {
}
# See https://github.com/hashicorp/terraform-provider-google/issues/15454
lifecycle {
ignore_changes = [ dns_config, gateway_api_config ]
}
enable_autopilot = true
}
Emissary Install:
variable "chart_version" {
description = "emissary chart version"
# default = "v8.9.1"
default = "v8.12.2"
}
resource "helm_release" "emissary" {
name = "emissary-ingress"
namespace = kubernetes_namespace.emissary.id
repository = "https://app.getambassador.io"
chart = "emissary-ingress"
version = var.chart_version
// this was testing to see if increasing memory limits helped -- it didn't
set {
name = "resources.limits.memory"
value = "4Gi"
}
set {
name = "resources.requests.memory"
value = "4Gi"
}
}