How Achievers load tests Kubernetes

After Achievers built a load testing framework to help our engineering teams understand how our platform performs under unexpected load, the team was able to improve overall platform performance, allow for 4x more traffic throughput, and scale clusters more effectively. This blog reviews goals set from the initial baseline results, and how we resolved bottlenecks we encountered during the testing phase. Now that we had a good report of the current state of the platform, it was time to beat that baseline.

Beating the baseline

Let’s do a quick review of the metrics we monitored and reported on.

Throughput: Refers to the number of requests that a system can handle per second.
Errors: Number of failed requests that occurred during the test period.
Latency: The time taken for a request to be processed from the time it was sent out.
Scalability: The platform's ability to handle increasing loads without significant performance degradation.

Having highlighted the metrics above, we then set a goal for each. This was done to effectively measure if we can successfully improve our system.

Outcome goals to measure load testing success

To measure testing results and track expectations, we created observability dashboards built in New Relic. Dashboards prepared the team to quickly identify bottlenecks and issues that arise during the testing period.

After reviewing the dashboards it was apparent bottlenecks were limiting our platform performance.

Throughput optimization

Istio Load balancing

gRPC keeps TCP sessions open as long as possible to maximize throughput and minimize overhead, but the long-lived sessions make load balancing complex. This is particularly an issue in auto-scaled Kubernetes environments. When load increases new pods are added, however, the client will stay connected to the old gRPC pods, resulting in unequal load distribution.

Below is a great example of gRPC traffic travelling to a deployment and not balancing.

New Relic dashboard showing unequal balancing of requests per second (RPS)

You can see we are getting sporadic requests to each pod, including one pod getting no traffic at all. Istio helps us distribute requests by sharing connection information between Envoy proxies. The Envoy proxies share the number of requests being received, Istio can determine which pod is less busy and can better serve the request.

Configuring load balancing on your destinationRule to use LEAST_REQUEST instead of the default of round-robin, helped solve the issue of unbalanced traffic.

DestinationRule with LEAST_REQUEST set

After the new loadBalancer configuration was deployed, the results show an even distribution of requests among pods and a large bump in throughput across all services.

Results of an evenly balanced distribution of requests per pod

Errors

Client-side

Not all bottlenecks are on the server. We noticed a few bottlenecks on the load testing client generating the load resulting in a high error rate. While we had enough CPU/memory to run our tests, our network had some limitations which revealed themselves.

Google Cloud virtual machines without an external IP address have traffic travelling outbounds through a Cloud NAT. When running a load test from inside another GCP project, egress traffic spiked OUT_OF_RESOURCES errors during the testing period.

High error rate on outbound connections during testing

The team added extra IP addresses and bumped the “ports allowed to a destination address” to the Google Cloud NAT. You can see from the image below we hit a flat line limit before solving the issue at the end of the graph.

Flat lines show resource exhaustion — The end of the image shows the issue resolved

A long-term solution would be to fully move to IPV6 to remove outbound connection limits. Achievers plans to be on IPV6 for all our networking in 2024.

Latency

Istio config

Since all cluster traffic travels through the Istio Envoy proxy, we thought there would be room to improve latency. During the investigation we did not see any long transactions hanging in Envoy, however, we thought this would be a good time to see if we could improve Istio by tweaking some configuration. Istio Concurrency is the number of worker threads which run on the Envoy proxy sidecar. If unset, the default is 2. If set to 0, this will be configured to use all cores on the machine using CPU requests and limits to choose a value, with limits taking precedence over requests.

Concurrency Results:

4 workers => 4434.11/s + p(95)=114.67ms

6 workers => 4230.42/s + p(95)=98.3ms

2 workers => 4519.10/s + p(95)=100.75ms

In our case, Istio concurrency didn’t seem to impact performance at all. It was worth the exercise to see if it would benefit us, although in the end there were no significant gains.

Code improvements

Often shared code can introduce issues across services. One example of this is when a bug in our database connector caused quite a bit of extra latency in outbound connections.

Extra database calls caused unnecessary latency — Red shows a breach in SLO (measured in milliseconds)

Using New Relic tracing we were able to easily spot a pattern. In the image above you can see latency results were high on endpoints which had these slow database calls. Database locks were used to switch to the schema needed before executing the query. This costs additional time to maintain those locks. The issue was resolved by removing a call to switch schemas and instead appending the schema name to the query.

After the library was optimized, the service saw an overall improvement in the average response time.

After database calls were enhanced, the average response time dropped significantly(measured in milliseconds)

Another code issue was revealed on an old endpoint that lived in our monolith. The endpoint was seeing very poor performance, often causing requests to take longer than 10 seconds to complete.

One slow endpoint was taking 10 seconds to complete

After we upgraded the endpoint and optimized database queries, we managed to get that back down to milliseconds.

After enhancements to the code to reduce latency

Caching

Kubernetes pod out-of-memory

Large caching improvements came by moving some images to the CDN to improve response times. A few bottlenecks were located on services which rely heavily on Redis cache. While Redis should not be a dependency for services, it does greatly improve performance for applications.

Reviewing resource graphs on New Relic revealed Redis was being exhausted of resources in a couple of critical namespaces. The two purple lines exceed the Kubernetes limit, pointing to memory exhaustion and out-of-memory (OOM) errors.

New Relic Dashboard showing high memory usage on Redis masters during load testing

A quick PR to bump Redis Kubernetes requests quickly solved memory exhaustion. Overall, this was a great exercise to test how much of a hit on performance heavily cached services had when Redis resources became saturated.

Scaling

Nodes are the managed virtual machines a Kubernetes cluster will run workloads on. Even though the virtual machines may be ephemeral and come and go, you need to guarantee your cluster is performing well and scaling accordingly. Once you go through basic improvements such as tweaking node CPU, memory, and disk, you should review some of the other modifications made below which ensure your infrastructure is scaling effectively.

IP Address limitations:

Clusters are created with IP ranges, and often values are never revisited. At the beginning of the testing, we quickly noticed our cluster could soon hit a limit with the number of worker nodes in the cluster. Kubernetes will take the pod network and divide it across the worker node network, restricting the number of workers in your nodepools. Depending on the max_pods_per_node value, node creation is bound by the number of addresses in the Pod address range.

Example: Scaling flatlined during a load test on a Kubernetes cluster

For example, if you set the default maximum number of Pods to 110 and the secondary IP address range for Pods to /21, Kubernetes assigns a /24 CIDR range to nodes on the cluster. This allows a maximum of 2(24-21) = 23 = 8 nodes on the cluster.

The limit of IP addresses may result in your cluster’s nodes being unable to scale and workloads being throttled, or worse, crashing due to resource exhaustion. This was an easy fix for us as we just needed to adjust some of the IP ranges assigned to pods and workers. Google has also introduced “secondary IP ranges” which can assist teams who need to rebuild their clusters to address this issue.

TopologySpreadConstraints:

High-utilized Kubernetes Deployment may need to be distributed amongst different worker nodes to ensure resources are not exhausted. Using topologySpreadConstraints allows Pods to spread across the clusters’ failure domains such as regions, zones, hosts, and other user-defined topologies.

We often use this feature on our critical services to ensure our Kubernetes clusters scale effectively. When Unsatisfiable is set to DoNotSchedule it can force the cluster to scale if more than two pods live on all workers.

If we allow the cluster to scale with critical services, we ensure we have enough CPU and memory during a spike in traffic. Tweaking this value along with average scaling values, allowed for the cluster to scale 10x more effectively.

Conclusion

After we resolved all the larger bottlenecks, it was time to review our initial goals. The results were in!

The team was able to meet our goals in every case, often surpassing expectations. It was relieving to see we did not impact any of our defined SLOs for errors or latency by the end of our initial testing. By improving throughput by 4x we were able to confidently handle failover traffic in a single region during an outage. Overall the cluster scaling had also improved by introducing properly sized resources, along with Istio handling gRPC request load balancing.

While we will always have room for improvement, the Achievers platform now has a flexible framework to continuously load test and measure performance.