This article provides a guide on how to serve large language models on multiple nodes running on Kubernetes.
Challenges
Large Language Model (LLM) serving is a challenging task. Namely:
- It demands a lot of vRAM. For the state-of-the-art openweight models such as Llama3.1 70b instruct , with 16bit floating point hyper parameter decision, it requires around 140GB of vRAM to load the model into memory. Apart from the model itself, each node needs to accommodate the KV cache for the incoming requests. Given it also needs to support 128k token length, the KV cache can easily demands 35GB more vRAMs (well depends on whether you can put up with CPU RAM swapping or smaller max model length). This leads to 140 + 35 + 40 = 215GB vRAM usage.
- With such large vRAM memory footprint. Fitting everything in a single node is impossible, since it requires 3 H100 NvLink GPUs (each with 80GB vRAM). This leads to the requirement of gang-scheduled multi-node inference.
- To scale the the inference to accommodate more concurrent users, ideally we also want to scale the workload horizontally, however under the context of Kubernetes, the existing deployment patterns such as deployment and statefulsets do not appears to be fit for purpose.
- Loading a large language model onto the Kubernetes nodes is also non-trivial. For example a Llama3.1 70b model along uses 140GB disk space, and the state-of-the-art vLLM container weights 5GB. If we naively try to load the model into k8s nodes from container registry or huggingface, it is going to take around 1 hour or more. The slowness not only makes start up slow, but also incurs high computing resource wastage (imagining 3 H100 GPUs sitting idle for an hour just to load the model).
Introducing LWS
LWS is a sets of new API primatives and controllers sits on top of the Kubernetes API server. It extends the Kubernetes API to support the multi-node inference use case. Currently it is a project governed by the Kubernetes Special Interest Group (SIG). According to the documentation:
LeaderWorkerSet: An API for deploying a group of pods as a unit of replication. It aims to address common deployment patterns of AI/ML inference workloads, especially multi-host inference workloads where the LLM will be sharded and run across multiple devices on multiple nodes. The initial design and proposal can be found at: [http://bit.ly/k8s-LWS](http://bit.ly/k8s-LWS).
Here is the high level concept of the LeaderWorkerSet:
To summarise the architecture:
- We have a worker set that comes with 0 or more replicas of worker group.
- Each of the work group is a serving unit that might be comprised by 1 or more pods that is clustered together.
- Each of the work group comes with a leader pod that are managed by a leader statefulset (sts). The leader pod is responsible for:
- Serving as a entrypoint for the incoming inference requests.
- On initial loading it is responsible for loading the model and farm out the hyper parameters to the workers (well under the context of vLLM).
- Coordinating the workers to perform the actual inferencing.
- For every worker group it also has a statefulset (sts) of workers that is clustered with each individual leader.
- A worker group set can be scaled up and down via change the
replicas
field in the worker set spec. - Each individual worker group can also be “horizontally” scaled by change the “size” where
size = leader + worker counts
. - To ensure minimum network latency (not everyone is on infiniBand!) the solution is also topology aware, it makes sure the leader and workers that belongs to the same worker group are scheduled onto the same topology domain (node/subnet/zone etc).
- The failure mode handling of LWS is very all-or-nothing. If any of the pods within a worker group failed the entire worker group will be killed and rescheduled all over again.
Deploying LWS Operator
Currently the official approach is to deploy via yolo install.sh script , which is understandable considering the project is fairly bleeding edge.
However this is not particularly operatable in the production environment. As a result I forked the repo and “productionised” the operator’s kustomize configuration into a helm chart - https://github.com/jingkaihe/lws/tree/main/charts/lws . Obviously I didn’t hand craft the helm chart manually, instead it’s automatically generated by some good old make config . All credit to the awesome helmify script .
Currently you can pretty much have a productionised deployment via
variable "service_name" {
type = string
default = "lws-webhook-service"
}
variable "image" {
type = object({
repository = string
tag = string
})
default = {
repository = "registry.k8s.io/lws/lws"
tag = "v0.3.0"
}
}
resource "kubernetes_namespace" "lws_system" {
metadata {
name = "lws-system"
}
}
resource "helm_release" "lws" {
name = "lws"
repository = "oci://ghcr.io/jingkaihe"
chart = "lws"
version = "0.1.0"
namespace = kubernetes_namespace.lws_system.metadata[0].name
create_namespace = false
max_history = 3
set {
name = "installCRDs"
value = "true"
}
values = [
yamlencode({
fullnameOverride = "lws"
manager = {
image = {
repository = var.image.repository
tag = var.image.tag
}
resources = {
requests = {
cpu = "500m"
memory = "512Mi"
}
}
}
}),
]
}
Deploying a LWS
The actual LWS CRD is pretty gruesome thus this article won’t go too much details on it. Here is an example of how to deploying a LWS powered by vLLM with distributed runtime managed by Ray - https://github.com/kubernetes-sigs/lws/tree/main/docs/examples/vllm .
Operational Caveats
Black hole the metrics endpoint
By default the vLLM openAI compatible endpoint expose the prometheus metrics right at /metrics
path. You probably want to black hole the external access. You can do it via slapping "nginx.ingress.kubernetes.io/server-snippet" = "location = /metrics { return 404; }"
onto your ingress annotation. The example uses nginx, but the same idea can be applied to other ingress controller.
Lengthen the timeout for the incoming requests
Unlike the conventional HTTP requests/response, the inference response are usually streamed back to the client side in chunks. As a result you probably want to increase the timeout on the ingress end with something looks like
annotations = {
"kubernetes.io/ingress.class" = "nginx"
"nginx.ingress.kubernetes.io/rewrite-target" = "/"
"nginx.ingress.kubernetes.io/proxy-read-timeout" = "60"
"nginx.ingress.kubernetes.io/proxy-send-timeout" = "60"
"nginx.ingress.kubernetes.io/proxy-next-upstream-timeout" = "60"
}
vLLM parameters
Here are some parameters you might want to tweak:
--tensor-parallel-size
should be more or less equal tolws_size * number-of-gpus
and it must be a factor of 2 (2, 4, 8, 16 etc).--max-model-len
can be set to a small number if you are vRAM poor. It results reduced context window but saves vRAM.--enable-chunked-prefill=false
can also be utilised if you constantly hit OOM.--swap-space
is pretty hard to tune. Essentially the models are loaded into the CPU RAM first then being loaded into the vRAM. By default it’s 4GB but you can set it higher to speed up the model loading but do make sure you don’t run into OOM.
Observability
The prometheus metrics are available out of box for scraping. https://github.com/vllm-project/vllm/blob/main/examples/production_monitoring/grafana.json should also give you some ideas how to visualise the data.
Additionally you probably also want to measure the GPU usage. If you are using NVIDIA GPU you can you the gcdm-exporter
. That being said what metrics can be scrape is a little bit obscure. Personally I ended up with a combination of kubectl port-forward
the gcdm-exporter daemonset endpoint and reverse engineering the metrics and this doc
in the mix.
One thing you will notice is the GPU utilisation of vLLM nodes always stays at 90-ish percentage. This is because --gpu-memory-utilization
is set to 0.9 by default, besides all the model weights and KV cache are pretty much preloaded. It sorta works like jvm -Xmx
and -Xms
where the memory is preallocated.
Big artifact loading
As this article has previously suggested naively loading the models and vLLM container images from huggingface registry or container registry is pretty much a no go.
Here we highlight a few techniques you can apply to speed up the loading process.
GKE Image Streaming
In our case we uses GKE image streaming feature to speed up the vLLm container image loading. According to [this doc] this is how it works behind the scene:
With Image streaming, GKE uses a remote filesystem as the root filesystem for any containers that use eligible container images. GKE streams image data from the remote filesystem as needed by your workloads. Without Image streaming, GKE downloads the entire container image onto each node and uses it as the root filesystem for your workloads.
While streaming the image data, GKE downloads the entire container image onto the local disk in the background and caches it. GKE then serves future data read requests from the cached image.
When you deploy workloads that need to read specific files in the container image, the Image streaming backend serves only those requested files.
LLM model loading
The biggest loading overhead comes from the LLM model loading into the container itself. We managed to speed up the process via:
pip install -U "huggingface_hub[cli]" "transformers[torch]";
apt-get update && apt-get install -y curl;
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
install kubectl /usr/local/bin/;
huggingface-cli download $${MODEL_NAME} --exclude "original/*";
python -c "from transformers.utils.hub import move_cache; move_cache()";
cat <<EOF | kubectl apply -f -
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: ${each.key}-snapshot-$(date +%s)
spec:
volumeSnapshotClassName: llm-snapshot
source:
persistentVolumeClaimName: llm-snapshotter-${each.key}
- Preprovision a PVC with generous size at least can fit the model weights in. Practically we managed to fit the 70b model into 200GB PVC with provides 1GB/s+ model loading speed into CPU RAM.
- Run a Kubernetes job with the PVC mounted to the job container. Download the model weights into the mounted directory. And towards the end take a snaphot. The process looks like this:
Note you absolutely need python -c "from transformers.utils.hub import move_cache; move_cache()";
otherwise it will be carried out on the actual model serving which causes error since the mounted volume is not writable.
- When you want to serve the model, you restore the snapshot into a PVC with
storage_class_name: standard-rwo
andaccess_modes: ReadOnlyMany
, so that it can be mounted to multiple pods. Notes the provisioned IO throughout will be shared by all the pods, so you might want to give it a generous size depending on the size of your fleet. - Voilà! In your LWS resource definition you can now reference the PVC so that you can load the model directly from disk without the need to download it over the wire.
There are also other few options that has been considered, but are not chosen eventually:
- FUSE uses GCS/S3 as the backend - practically we noticed it is painfully slow with approx. 100MB/s read speed against GCS. In theory it can be improved with Google private service connect, but we have not verified it.
- NFS based solutions such as AWS EFS or Google File Store. It was not picked due to 1) NFS downtime will have a knock down effect on the entire system. 2) The Cloud based NFS solution generally comes with a hefty cloud bill, especially the IOPS often is tied to the volume that is provisioned.
Via this approach we managed to reduce the llama3 70B load time from 60mins (via GCS) to 3mins from GPU node being read to vLLM cluster being entirely up and running.
This Guide from anyscale provides a more mind blowing result, however is not applicable to us since we are not using AnyScale Endpoint.
Final Thoughts
In this article we showed that with SIG LWS it is possible to serving LLM in a multi-node manner.
Self hosted LLM is an appealing option when
- Proprietary solutions such as OpenAI or Anthroptic can not be used due to privacy, data sensitivity or regulatory concerns,
- Your LLM inference usage reached a certain critical mass where running your own model is more cost effective than using 3rd party inference as a service in terms of cost per 1000 tokens.
That being said it also comes with a few downsides namely:
- Complexities in terms of infrastructure and multi-pod gang scheduling on top of Kubernetes.
- Unless your usage reaches a certain critical mass, the cost per token is orders of magnitude higher vs using 3rd party inference services, unless you are willing to scale down to zero, but it is a significant sacrifice in terms of “first token latency” depending on the type of workload and whether it is tolerable by the end customer.
- You are likely out-competed by hyper-specialised 3rd party services if managing LLM is not your core competency.
- The model you are serving is likely to be inferior vs OpenAI.