Leveraging OpenTelemetry for Fault-Tolerant Prometheus Metrics with Envoy Mirroring
There are a lot of use cases where metrics collected from applications or services need to be forwarded from the local environment to remote, centralized long-term storage such as Thanos or Mimir.
This article will help build a fault-tolerant and highly available solution to collect and forward metrics from applications and services running in Kubernetes to remote Prometheus-compatible long-term TSDB storage. It also requires proper knowledge of the components used, such as the OpenTelemetry Collector, Prometheus in agent mode, and Envoy proxy request mirroring. Detailed configuration is outside the scope of this article.
How it works
The OpenTelemetry Collector collects metrics from the desired resources, and the pipeline is configured using OpenTelemetry Collector receivers, processors, and exporters to process and send the collected metrics to the endpoint of the Envoy proxy.
The Envoy proxy is configured with a static route mirror policy, with the Prometheus pods as upstream clusters. This means the Envoy proxy connects directly to the k8s pod and not to the k8s service in front of the pods. Each Prometheus pod represents an Envoy upstream cluster. Data is routed primarily to one of the two replicas of the Prometheus pod and mirrored to the second one.
Prometheus is deployed into the k8s cluster with two replicas in Agent mode, with the remote-write-receiver feature enabled. An external label prometheus_replica was also added to the instances, which is used to deduplicate series in Thanos sent from high-availability Prometheus instance pairs.
The result
This design helped make monitoring more resilient and reduced the time-series data gap in Grafana dashboards.
Where We Apply This
Resilient, gap-free observability is part of how we run platforms for clients in regulated, high-stakes environments — where a missing window of metrics can mean a missed incident. Explore our infrastructure services or meet the team behind Grow2FIT.