Serverless Exodus to GKE Autopilot

Author: Jingkai He | Posted on: September 13, 2024


Over the last year CECG has been working on an engagement within a client’s Advertising Technology division to deliver an Ad decision server solution. It comes with the following requirements:

  • Make ad-decisions based on upstream requests.
  • Forward requests to external ad decision-makers under certain criteria.
  • Support up to 500K RPS (requests per second).
  • P95 latency must be under 300ms.
  • Error budget is >= 99.9% success rate (non 5xx) over 5 mins rolling window.
  • Manage to handle thundering traffic- The service must be able to serve a disproportionately huge amount of traffic in a few seconds of time (imagining a Game of Thrones ad break)

Initial Implementation: Cloud Run

The initial solution was developed using serverless solutions namely GCP Cloud Run, In theory, it would allow us to:

  • Own fewer infrastructure pieces by the team of software engineers.
  • Prototype the service out quickly given less infrastructure overhead.
  • Ability to blitz-scale the service up and down horizontally based on the traffic
  • Paying <= $20k per month based on the pay-per-request model is not cost-optimised vs VM/server-based solution but still a massive saving from the business point of view.
  • On paper it can satisfy the business requirements in the introduction section

A few months into the development, the team soon discovered that the Cloud Run could not fulfil the requirements, notably:

  • 500K PRS simply cannot be satisfied. At maximum, only 150k can be supported when the Cloud Run is hit directly. The RPS was even worse when the traffic was routed through the API Gateway.
  • 5% average error rate due to 503 (service unavailable), 504 (gateway timeout) and 429 (too many requests which indicates the rate-limiting kicked in from Cloud Run) errors. Later it was acknowledged by Google that it was their DDoS protection kicking in.
  • High latency - Under high load the server suffers 500ms+ P95 latency which cannot satisfy the requirement of the business.
  • The service does not auto-scale elastically and often scales out late.
  • Low concurrency rate has to be set on each of the cloud run execution runtime (recommended by GCP) with only 10 max requests per instance on a 1vCPU & 2G RAM baseline.


Take Many Steps Back: VM test

During local application code profiling we noticed that Ad decision proxy servers can easily achieve 150k+ RPS on a local Linux VM workstation running on VMWare Fusion. During the local test traffic flowed from wrk cli through the local loopback interface without hitting the real network stack thus it wasn’t particularly realistic, however, the performance was convincing enough to leave us to believe that relatively good performance can be achieved by hosting the ad-decision server on GCE Managed Instance Group (MIG). As a result, we tried out the following setup:

  • Deploying the ad-decision server on GCE virtual machines (VMs).
  • Each VM on startup launch an ad decision server container running on the host network following https://cloud.google.com/compute/docs/containers/deploying-containers
  • VMs are managed by MIG, auto-scaling based on the CPU usage.
  • VMs are fronted by an application load-balancer (ALB)
  • The ALB also performs TLS termination and instance group health checking.

We quickly tested the setup. As it turned out it handles 500K RPS in a breeze and auto scales reasonably quickly. That being said this setup isn’t without its downside namely:

  • Management overhead:
    • Endpoint security software has to be installed on each of the vm to be compliant with organisational security policy.
    • Extra complexity of dealing with VM patching lifecycle.
    • Building a promotion pipeline for a VM-based fleet with a container-based workload was non-trivial (well now we have https://kamal-deploy.org/ !)
    • The existing SRE team doesn’t have VM management expertise
  • The setup comes with a high cost of ownership which doesn’t quite align with the “owning less infrastructure” strategy from the department.
  • Admittedly running a fleet of VM in 2022 felt out-of-fashion, and hard to standardise thus no chance of wider adoption within the department.

The “Middle ground”: GKE Autopilot

During the initial assessment stage we also looked into running the ad-decision server on Google Kubernetes Engine (GKE), Considering:

  • It’s highly optimised by Google to run container-based workloads without much need from us to fine-tune kernel params. As a result, we’d expect it’s as performant as the VM-based solution if not better.
  • At the same time it comes with rock solid deployment primitives for reliable web service rollout, which also makes it easy to standardise, thus easier to adopt organisational-wide and have a bigger impact.

However, the team was wary of introducing it mostly because we don’t want to be hit by accidental complexity - as a product team, the cost of ownership of a standard GKE cluster massively outweighs the benefits it brings.

Sometime later GKE Autopilot drew our attention. It’s a GKE variant that provisions and manages the cluster’s underlying infrastructure including nodes * node pools, providing end users with a highly opinionated and optimised cluster, and aims to provide a hands-off experience to end customers. It is appealing to us because of:

  • Hands-off experience: As a product team we don’t need to manage the VMs, VM pool and other GCE underlying infrastructure.
  • Close to VM performance that satisfies the business requirements. On Autopilot due to restriction, we can no longer run containers on host mode, but the networking stack is highly advanced and performant .
  • Seamlessly integrated with other GCP services (e.g. StackDriver monitoring suites) via workload identity.
  • Pay-as-you-go billing model: We are only billed for the computing usage of the pod + a light-touch management control-plane fee (where the cost can be driven down further via EDP discount and committed-use discount) - This is much cheaper than Cloud Run’s pay-per-request billing model, which makes sense on a low traffic scale, but the benefit diminishes drastically on a high-traffic profile.
  • Building services on top of Kubernetes allows us to leverage the API as the “common protocol”, thus making the solution more lift-and-shiftable onto other platforms.
  • Last but not least it aligns with the “owning less infrastructure” strategy from the department.

GKE Autopilot was eventually chosen as the runtime for the ad-decision server going forward, which surprisingly only took a day to migrate over from the existing Cloud Run based solution.



Takeaways

Serverless-based solutions such as Cloud Run do scale for the large majority of the use cases (don’t get me wrong!), but it’s not a silver bullet. In our circumstance where the service needs to serve disproportionate “unnatural” traffic, it makes more sense to go for the “pay for computing resource you rent” model (akin to GKE Autopilot) than the “pay per request model”, from an economic, performance and reliability point of view.

By adopting GKE Autopilot, we are now only paying 7.5% of the Cloud Run cost that we initially anticipated in the production environment, and we will manage to save even more after the committed-use discount has been applied.

More importantly, adopting GKE Autopilot provides performance and reliability for the service required by the business, and much improved scalability. Despite being a hands-off solution, it provides slick stack-drive integration for observability purposes and provides us with the tools to fine-tune the performance of ad-decision servers. During the e2e load test it out-performs the downstream ad-decision maker despite only using ⅓ of its computing resources - which is probably a story for another day.