Leveraging OpenTelemetry for Fault-Tolerant Prometheus Metrics with Envoy Mirroring
This article will help build a fault-tolerant and highly available solution to collect and forward metrics from applications and services running in Kubernetes to the remote Prometheus-compatible long-term TSDB storage. It also requires proper knowledge about the components used, such as the OpenTelemetry collector, Prometheus in agent mode, and Envoy proxy request mirroring. Detailed configuration is outside the scope of this article.
The OTEL collector collects metrics from desired resources, and the pipeline is configured using OpenTelemetry collector receivers, processors, and exporters to process and send collected metrics to the endpoint of the Envoy proxy.
The Envoy proxy is configured with a static route mirror policy with upstream clusters of Prometheus pods. This means that the Envoy proxy directly connects to the k8s pod and not to the k8s service in front of the pods. Each Prometheus pod represents an Envoy upstream cluster. Data are routed primarily to one of the two replicas of the Prometheus pod and mirrored to the second one.
Prometheus is deployed into the k8s cluster with two replicas in Agent mode with the remote-write-receiver feature enabled. Also, an external label prometheus_replica was added to instances, which is used to deduplicate series in Thanos, sent from high-availability Prometheus instances pairs.
This design helped make monitoring more resilient and reduced the time series data gap in Grafana dashboards.
Senior DevOps Engineer
Dedicated professional with experience in managing cloud infrastructure and system administration, integrating cloud-based infrastructure components, and developing automation and data engineering solutions. Good at troubleshooting problems and building successful solutions. Excellent verbal and written communicator with strong background cultivating positive relationships and exceeding goals.
The entire Grow2FIT consulting team: Our team