Technical Article 5 February 2024 Gabriel Illés, Senior DevOps Engineer

Leveraging OpenTelemetry for Fault-Tolerant Prometheus Metrics with Envoy Mirroring

Pipeline diagram: the OpenTelemetry Collector forwards metrics to an Envoy proxy, which mirrors them across two Prometheus replicas before remote-write to long-term storage

There are a lot of use cases where metrics collected from applications or services need to be forwarded from the local environment to remote, centralized long-term storage such as Thanos or Mimir.

This article will help build a fault-tolerant and highly available solution to collect and forward metrics from applications and services running in Kubernetes to remote Prometheus-compatible long-term TSDB storage. It also requires proper knowledge of the components used, such as the OpenTelemetry Collector, Prometheus in agent mode, and Envoy proxy request mirroring. Detailed configuration is outside the scope of this article.

How it works

The OpenTelemetry Collector collects metrics from the desired resources, and the pipeline is configured using OpenTelemetry Collector receivers, processors, and exporters to process and send the collected metrics to the endpoint of the Envoy proxy.

The Envoy proxy is configured with a static route mirror policy, with the Prometheus pods as upstream clusters. This means the Envoy proxy connects directly to the k8s pod and not to the k8s service in front of the pods. Each Prometheus pod represents an Envoy upstream cluster. Data is routed primarily to one of the two replicas of the Prometheus pod and mirrored to the second one.

Prometheus is deployed into the k8s cluster with two replicas in Agent mode, with the remote-write-receiver feature enabled. An external label prometheus_replica was also added to the instances, which is used to deduplicate series in Thanos sent from high-availability Prometheus instance pairs.

The result

This design helped make monitoring more resilient and reduced the time-series data gap in Grafana dashboards.

Grafana dashboard showing continuous, gap-free time-series metrics from the fault-tolerant pipeline

Where We Apply This

Resilient, gap-free observability is part of how we run platforms for clients in regulated, high-stakes environments — where a missing window of metrics can mean a missed incident. Explore our infrastructure services or meet the team behind Grow2FIT.

← Back to Blog

How it works

The result

Where We Apply This

Working on observability at scale? Let's talk.