Team of professionals

Back to all news

Enhance OpenTelemetry gRPC With a Consistent Hash Load Balancer

This article demonstrates leveraging the envoy's consistent hash load balancing for OpenTelemetry OTLP gRPC payload.

The use case

OpenTelemetry collector (OTel collector) is deployed as an agent alongside the application on remote servers. It sends telemetry data (logs, traces, metrics) from the application and the host into central storage through a gateway deployed on the Kubernetes cluster.

The OTel collector is deployed using the OpenTelemetry operator Helm chart, with Kubernetes HPA, scaling replicas based on CPU load. The traffic is routed through a headless service because the standard Kubernetes service is not a good fit for gRPC, described in this article. But with this setup, there is no load balancing on the Kubernetes side, which is also mentioned in the article in the above link.

So, this lack of load balancing with the OTel agents configured to send data in batches causes the data from the same remote host to be forwarded randomly through the OTel collector gateway replicas. Data are written multiples by the actual number of replicas into the storage due to different label values holding the identity of the OTel replica. This drastically increases the storage usage, and the queries must be aggregated.

Let’s show it in an example.
Take one of the OTel agent metrics called otelcol_process_uptime, which has a label added by the OTel gateway called otelcol_replica, holding the name of the replica. The OTel gateway has four replicas; let’s query the metric using PromQL on the storage side:

avg by (otelcol_replica)(otelcol_process_uptime{hostname="xxxxxx"})
{otelcol_replica="opentelemetry-collector-5fc9f8g5sj5"} 2502046.749352578
{otelcol_replica="opentelemetry-collector-5fc9f8pfmvh"}
2502096.74889717
{otelcol_replica="opentelemetry-collector-5fc9f8rzkh4"}
2502156.749325255
{otelcol_replica="opentelemetry-collector-5fc9f8xj95v"}
2502136.749453457

As demonstrated, the data coming from the remote host are written four times into the storage.

So, the solution to this problem is a load balancing mechanism, which provides consistency in routing data from the same remote source through the same OTel collector replica. And that’s where the envoy-proxy is a perfect candidate, offering load balancers based on consistent hashing.

The solution

The envoy-proxy is deployed with two replicas and a headless service between the ingress and OTel collector gateway.

It is configured with a ring-hash load balancer based on the X-Forwarded-For HTTP header, enabling HTTP2 for upstream clusters.

...
route:
  cluster: "opentelemetry-collector-cluster"
  hash_policy:
    - header:
        header_name: x-forwarded-for
...
clusters:
- name: opentelemetry-collector-cluster
  connect_timeout: 0.25s
  type: STRICT_DNS
  dns_lookup_family: V4_ONLY
  lb_policy: RING_HASH
  http2_protocol_options: {}
...

This configuration ensures that the data from the same source IP will flow through the same OTel gateway replica while it exists. With this consistent route, only one copy of the data is written into storage from the remote host.

In case the replica fails, the envoy-proxy will redirect the data flow to the next member of the hash ring, so for a short period in the storage, two copies of the data will exist due to the changed value of the label holding the identity of the OTel collector replica.

Conclusion

Consider a high-load environment where the number of the OTel gateway replicas could be scaled to quite a high number. How much storage capacity could be saved with a reliable data flow from remote sources?

Author

Gabriel Illés
Senior DevOps Engineer

Dedicated professional with experience in managing cloud infrastructure and system administration, integrating cloud-based infrastructure components, and developing automation and data engineering solutions. Good at troubleshooting problems and building successful solutions. Excellent verbal and written communicator with strong background cultivating positive relationships and exceeding goals.

The entire Grow2FIT consulting team: Our team

Related services

DevOps services

Team of professionals

Enhance OpenTelemetry gRPC With a Consistent Hash Load Balancer

The use case

The solution

Conclusion

Author

Related services

Don't miss the latest news