Over the years, we have observed several common issues when it comes to Kubernetes secret management by our clients. Very often the problems start even before the secrets get to the required application. Lack of strong company-wide security policies results in passing plain text secrets through emails or private chats which eventually find their place in service configuration files on GitHub in plain text (sounds familiar?!). Kubernetes' secrets can solve part of the problem, however, it is far from a perfect solution due to:
- cluster-only scope; as a result, only apps on the Kubernetes cluster can access secrets
- Kubernetes secrets are base64 encoded on etcd in plain text, which leads to extra complexity and maintenance to encrypt
One of the solutions to mitigate those issues is to store secrets in an external secret manager and implement a secure mechanism to allow access to secrets by the developers as well as Kubernetes workloads. At CECG, we recognise that choosing the right mechanism can be challenging. There are a lot of factors when it comes to implementation, such as cost, environment, maturity of the tenants within the organisation and security profile. In this article, we are reviewing the four most common mechanisms (also free!) by comparing their characteristics in the context of Hashicorp Vault as an external secret manager.
External Secrets Operator
External Secrets Operator (also referred to as ESO) simply synchronises secrets from a variety of external secret management systems into native Kubernetes secrets. It uses intuitive custom Custom Resource Definitions that are used to configure access to underlying secret storage and to manage synced secrets. It comprises a single-operator Kubernetes deployment with appropriate cluster-wide privileges to update secrets. When a CR with a secret resource is created, the operator loads the configuration for the corresponding external secret store and fetches that secret. Once fetched, the secret then is written to the specified namespace and can be consumed by the pod via the native Kubernetes secret mechanism. From that point, the access control to the secrets is handled by Kubernetes RBAC.
Pros
- supports Token, AppRole, Kubernetes, LDAP, JWT/OIDC authentication methods
- small resource requirements as requires three single-pod deployments (one for the operator, one webhook and its certs controller)
- allows a refresh interval to be specified per secret, but also can manually trigger
- unavailability of the operator doesn’t affect pods that use secrets already present in the cluster, but only how fresh the secrets are
- status available for each CR with the last-synced time
- provides a bunch of features such as templating to transform secret data, secret generators and various providers (even including one to fetch secrets from another Kubernetes cluster)
- exposes Prometheus metrics on several syncs, errors and status for each secret
- open-sourced, free and can be contributed to (something to do for long winter evenings!)
- due to its loosely coupled architecture, the provider can be replaced with a relatively small effort to tenants
- reduced load on Vault API as secrets are synced per CRD (instead of per pod as we’ll learn with alternative mechanisms)
Cons
- let’s start with a big one!, Vault provider only supports KV2 Secret Store, however recent ESO addition of Generators opens the door to other secret engines
- secrets are stored in two places which increases the risk of exposure
- CR status doesn’t show which secret version is used by each pod, only when it was last synced
- all your workloads must run on Kubernetes as it uses the native Kubernetes secret mechanism (obvious, but often missed requirement!)
- audit log for secret access in two places (Vault and Kubernetes)
- not the best documentation (light drawback as this is the case for many if not most of the open-sourced products)
Kubernetes Secrets Store CSI Driver
The Kubernetes Secret CSI driver allows Kubernetes to fetch secrets from external sources and make them available to pods. Its main feature allows mounting retrieved secrets into pods as volumes, however, it also provides a feature of syncing secrets into Kubernetes secrets. This solution requires the installation of the CSI driver and Vault provider, which is used as an interface to the underlying storage provider. It uses a custom CRD to define the provider and the secrets to retrieve. When a pod requests a CSI volume to be mounted, the driver communicates with the provider which in turn requests specified-in-CR secret using the pod’s identity. Once retrieved, the secret is written to the pod’s mounted volume or when configured, Kubernetes secrets.
Pros
- all Vaults’ secret engines supported
- supports secret rotation with a configurable global rotation period, and updating the source secret updates the value in the mounted path. If secret rotation is disabled, doing a rolling deployment loads the new secret value
- secret CR status shows the version of the secret used by each pod
- exposes Prometheus metrics on the number of volume mounts, syncs, errors and duration of sync
- due to a provider being behind an interface, it allows switching underlying secret storage providers without modifying pods or the CSI driver (only CR update required)
- fairly good documentation with clear examples
Cons
- only supports Kubernetes and JWT authentication methods
- as of the time of writing sync to a Kubernetes secret and secret rotation is an alpha feature and there are known limitations
- mounting secrets as environment variables are not directly supported but require syncing to a Kubernetes secret first
- although running pods are unaffected, your pods using the CSI driver are unable to start when either driver or provider is unavailable or having connectivity issues (might be proven to be problematic when your app can still offer some functionality even though that db cannot be accessed)
- a bit more costly than other solutions as it requires one privileged pod daemonset for the driver and one for each provider (however, not to worry, your company Christmas party shouldn’t be cancelled!)
- compared to other solutions, it requires additional maintenance to keep up to date and maintain its components' compatibility (driver, providers and their CRDs)
- no support for templating, potentially adding some complexity to tenant workloads to render their secrets
- if you want to enable Kubernetes secret sync, you’ll need to mount the secret first to a pod. At this point, your secret in Kubernetes is tied to the pod lifecycle. When the pod is terminated, your Kubernetes secret is deleted
- cluster-aware workloads, because of auth method is restricted to Kubernetes only, each pod has to provide a unique name of the auth method within its manifest (which in turn is coupled to the cluster)
- the driver requires permissions to mount kubelet hostPath volumes with the ability to view pod service account tokens, therefore stronger security measures are required to protect from application vulnerabilities such as traversal attacks
Vault Secrets Operator
The Vault Secrets Operator (VSO) syncs the secrets between Vault and the Kubernetes secrets in a specified namespace. The secrets are still managed by Vault, but can be accessed natively by Kubernetes workloads. It uses custom CRDs to maintain access to Vault and manage Kubernetes secrets. The Secrets operator watches for changes to the source secrets in the Vault and writes those to Kubernetes through the API. Similarly to ESO, it requires cluster-wide privileges to create secrets, service account tokens, and roll deployments.
Pros
- all Vault secret engines supported
- all Vault authentication methods supported
- unavailability of the operator doesn’t affect the pod lifecycle, but only how fresh the secret might be
- constant secret reconciliation, on change to Vault secret the Kubernetes secret will be automatically updated. The operator doesn’t stop there, it is also able to trigger deployment rollout on secret change so no need for re-load logic in tenant workloads
- exposes Prometheus metrics endpoint (which are quite extensive such as Kubernetes-Vault request latency, number of secrets managed, sync stats, etc.)
- allows a refresh interval to be specified per secret, but also it can manually trigger
- provides a bunch of Vault features such as templating to transform secret data, and caching
- reduced load on Vault as secrets are synced per CRD instead of per pod
- small resource requirements
Cons
- requires an account with cluster-wide permissions to manage secrets and rollout deployments
- secrets are stored in two places which increases the risk of exposure
- all your workloads must run on Kubernetes as it uses Kubernetes secrets
- audit log for secret access in two places (Vault and Kubernetes)
Vault Agent
Vault provides a Kubernetes mutating webhook controller called Vault sidecar agent injector. As the name suggests, it adds a Vault Agent sidecar container to the pods when discovering a specific annotation within the pod manifest during pod creation. The sidecar then uses the pod identity to fetch requested secrets directly from Vault during pod initialisation. Once the pod is running, it runs the agent alongside the container which provides further capabilities such as secret renewal, rotation, templating and caching.
Pros
- all Vault authentication methods supported
- all Vault secret engines supported
- secrets are written straight to the pod’s volume without the need to sync them to Kubernetes secrets
- use in-memory tmpfs volumes to mount secrets
- has templating options to render the secrets
- easier for teams to debug config issues as authentication issues are visible in their Vault Agent sidecar logs
- secret lifetime is tied to the lifetime of the pod
- useful features such as automatic renewal, secret rotation and caching
- ability to customise Vault Agent behaviours for individual pods through pod annotation ex. disable secret reconciliation by disabling the side container after the pod starts
Cons
- limited support for exposing secrets as environment variables (you can achieve that but only through templating, and the application would have to reload on change or be restarted)
- dynamic secrets feature requires the pod to reload the secret to load a new value
- when using the rotation/caching feature, each pod has to run an additional Vault Agent sidecar that requires CPU and memory, however, this can be changed globally or for each pod via annotations, in addition, for a pod that doesn’t require this feature, it can also be disabled through annotations)
- a bit more complex set-up due to the required connectivity between the pod’s agent and the Vault server
- higher maintenance/support effort as each pod is running a side container
- no out-of-the-box ability to view a secret version being used in the pod without running commands within the container
- similar to CSI driver running pods are unaffected, however, pods are unable to start when the Vault server or webhook controller is unavailable
Top Features Comparison
Vault Agent | Vault Secrets Operator | Kubernetes Secrets Store CSI Driver | External Secrets Operator | |
Secret Engines | All | All | All | KV only by Vault provider (other supported through Generators) |
Authentication Methods | All | All | Kubernetes and JWT | Token, AppRole, Kubernetes, LDAP, JWT |
Mount as Volume | Yes | No | Yes (Ephemeral Disk, with host path volumes) | No |
Sync to Kubernetes Secrets | No | Yes | Yes (optional, required pod mount) | Yes |
Env Variable Within Pod | Yes (with templating env variables) | Yes | Yes (only when synced to Kubernetes secrets) | Yes |
Helm Chart | Yes | Yes | Yes | Yes |
What to pick?
Choosing the right approach is highly dependent on the size of the platform, the platform architecture (multi-cloud, multi-cluster with hundreds of tenants vs small single cloud, few clusters, few tenants), its requirements and business constraints.
External Secrets Operator and Vault Secrets Operator are both great choices when you’re aiming to run all your workloads only on Kubernetes and don’t have any third-party tools running in the cluster that require Vault secrets. The fact that your workloads are not coupled to Vault in any shape or form (as you’ll be using native secrets) gives you flexibility if you ever need to replace the underlying secret store. It’ll also reduce the blast radius in case of ESO/VSO outage/malfunction as in most cases, your pod lifecycle will not be affected. The above benefits, coupled with great observability features make both tools strong contenders when aiming for a native way to access external secrets through Kubernetes. However, Vault Secrets Operator offers a full range of secret engines out of the box at the expense of stronger coupling to Vault. In our experience, those tools do not change very often, whereas the dynamic secrets feature with deployment rollout (that only VSO supports) is likely to be desired by existing or future onboarding teams.
The CSI driver and Vault Agent can both mount secrets as volumes. You can still keep a fair separation between the secret store and tenant awareness of the secret mounting mechanism. The CSI Driver might appear to have an advantage of rendering secrets into Kubernetes secrets, however, this feature is not as useful as you think, as the synced secret is tightly coupled to the mounting pod lifecycle. Therefore, when ignoring syncing features, Vault Agent is a stronger contender as it offers much better capabilities such as templating, secret rotation and caching. The ability to customise the agent functionality by annotations to suit individual workload needs makes it a good, future-proof and flexible solution that can grow in sync with your platform.
The cutting edge is, that when you need to natively access the secrets from Vault, Vault Secrets Operator becomes a more compelling option. When you don’t or even have a requirement not to sync to Kubernetes, Vault Agent is the leading mechanism. I know the above statement won’t do it justice, but overall, considering the options and their characteristics, we can generalise it to those choices.
What about the tenants' efforts?
While picking the right solution, we (platform engineers) cannot forget our tenants and must also consider the complexity and effort placed on them to use Vault secrets in their workloads. In our experience very often, there is a balance between the amount of effort put in by the platform team and ease of use (including awareness of internal workings) for tenants. It makes sense for the organisation to encapsulate the complexity of each integration as much as possible in the platform team/s. In the end, this brings one of the main benefits of having developer platforms in the first place. In that department, Vault Secrets Operator is more favourable simply due to the ability to mount secrets through the native Kubernetes mechanism. Basic tenant needs such as exposing environment variables through Vault Agent require a bit more effort due to the responsibility of exposing them being shifted to tenants. In addition, with VSO, you might get fewer midnight support calls too, as restarting your pods won’t be affected during operator or Vault failures in contrast to Vault Agent.
Unfortunately, those benefits come at a cost. Designing a multi-tenancy, secure, isolated, least-privileged secret management sync mechanism with VSO requires a bit more effort as it needs careful consideration when designing isolation among tenants' secrets. Vault Agent doesn’t impose those challenges due to direct secret to pod mapping but at the expense of the additional complexity being placed on tenants. Both drawbacks can be mitigated by various techniques: platform helm charts, additional components such as policy agents, etc. However, keep in mind that those translate to higher costs and increased maintenance efforts for the platform team.
Summary
Implementing a secret management integration mechanism is not a trivial task. It is important to choose the most suitable mechanism to ensure your tenants have a clear, secure way of storing and accessing secrets without placing them on an unnecessary cognitive load. In our experience, given the business constraints, when the right balance is struck, usually follows a fast and high adoption rate, which in the world of security, is an absolute necessity (and highly rewarding for us, hardworking platform engineers!).