Evaluating Large Scale Solutions for Multi Tenant Metrics System

In our work with a client, we encountered a challenge with their multi-tenant Kubernetes platform. The platform was designed to provide a flexible environment where each tenant could independently manage their own services and infrastructure. As part of this setup, tenants were encouraged to create and maintain their own monitoring stacks using Prometheus and Alertmanager.

While this approach offered tenants a high degree of autonomy and allowed them to tailor monitoring to their specific needs, it also introduced several issues:

Complexity in Management

Each tenant was responsible for deploying, configuring, and maintaining their own instance(s) of Prometheus and AlertManager. This led to a lack of standardization and increased the operational burden on tenants, many of whom did not have deep expertise in running a monitoring stack.

Resource Duplication

The proliferation of individual Prometheus instances resulted in multiplicative resource consumption, as each tenant's monitoring stack consumed CPU, memory, and storage. This led to inefficient use of the cluster's overall resources.

Difficulty in Aggregating Metrics

With multiple isolated instances of Prometheus, it became challenging to aggregate metrics across tenants or gain a holistic view of the platform's health. This fragmentation made it difficult for the client's operations team to oversee the entire Kubernetes environment and identify potential issues affecting multiple tenants.

Scalability Concerns

As the number of tenants grew, the complexity and resource demands of managing separate monitoring stacks increased, raising scalability concerns for the platform. This was further compounded by the fact that Prometheus is not designed to scale horizontally.

This situation highlighted the need for a more centralized and efficient approach to monitoring that could retain some level of tenant autonomy while addressing the above challenges. To address these issues, we evaluated several metrics solutions that could provide a more streamlined and scalable monitoring experience for the client's multi-tenant Kubernetes platform.

Evaluation Criteria

When evaluating potential technologies, we considered the following criteria:

Performance: How well does it handle monitoring and querying large volumes of metrics? What is the latency for data ingestion and retrieval?
Scalability: Can it handle growth in data and user base without significant refactoring?
Ease of Integration: How well does it integrate with our existing tech stack used by the tenants?
Multi Tenancy: How well does it support multi-tenancy? For example, data isolation or noisy neighbor prevention?
Cost: Is it cost-effective in the long run? What are the costs associated with maintenance and scaling?
Community Support: Does it have an active community and robust documentation?

Options Considered

1. Prometheus + Thanos

Thanos is a sidecar deployed alongside Prometheus which helps provide Unlimited Retention (thanks to storing metrics in Object Storage), Global Querying and Downsampling.

Pros

Familiarity and Broad Adoption: Prometheus is already widely used within the platform, and Thanos extends its capabilities for long-term storage and global querying. In fact some tenants were already using this setup and were familiar with the components.
Cost-effective Retention: Thanos provides a cost-effective solution for long-term storage by leveraging object storage backends like S3 or GCS.
Global Querying: Thanos enables querying across multiple Prometheus servers, allowing for a more centralized view of metrics.

Cons

Operational Overhead: Managing and orchestrating components required such as Prometheus servers, Thanos sidecars, a Thanos query layer, compactors, and receivers can be complicated, especially in a dynamic Kubernetes environment.
Lack of Scalability: While Thanos provides a scalable solution for long-term storage, it still relies on a single Prometheus server for querying, which can become a bottleneck as the number of tenants and metrics grow.
Resource Consumption: Running multiple Prometheus servers and Thanos components can lead to high resource consumption, especially in a multi-tenant environment.

Summary

Even though Prometheus + Thanos would provide a flexible monitoring solution that would at least help remove some scalability issues we are currently facing (especially on metrics storage and querying), we felt that its reliance on Prometheus which is not highly available would mean that it could become a bottleneck. This solution would mean that the tenants will continue to shoulder the weight of having to maintain their own Prometheus stack. For these reasons, we did not pick this option.

2. Victoria Metrics (VM)

VictoriaMetrics is a time-series database that has gained popularity as an alternative to Prometheus for storing and querying metrics, especially in environments where scalability and storage efficiency are critical.

Pros

Scalable: Unlike Prometheus, which is single-node by default, VictoriaMetrics can run in a clustered mode, enabling horizontal scaling without needing additional components
Easy Setup: Runs as single binary, can ingest millions of samples per second, making it suitable for high-frequency metrics
Data Compression: Offers optimized storage with high compression ratios, resulting in lower disk space usage compared to many alternatives.
Multi Tenancy: It has built-in multi-tenancy support, allowing multiple teams or tenants to use the same instance while keeping data separate.

Cons

New Technology: VM is relatively new and does not have the same breadth of integrations or community knowledge base.
Premium multi-tenancy features: Certain multi-tenancy features such as VMAlert management and tenant rate limiting are only available in the Enterprise version
Stateful Components: VMStorage which is deployed as stateful set to Kubernetes, do not auto-scale - we would need to monitor and re-size ourselves when needed
Storage on Disks: VM retains metrics in HDD-based block storage which could rack up the costs (compared to object storage options)

Summary

Even though Victoria Metrics seemed like a very robust and scalable solution we were discouraged by the fact that some of our key multi-tenancy requirements such as per-tenant alerting and downsampling were only available in their enterprise version.

3. Grafana Mimir

Grafana Mimir, open sourced project forked from Cortex, with reduced complexity and improved scalability, is a powerful time-series database that provides scalable solutions for monitoring and observability

Pros

Scalable: It uses a microservices architecture, enabling horizontal scaling by adding more instances to each component (ingester, querier, store-gateway, etc.)
Highly Available: Mimir performs replication across 3 different availability zones providing data redundancy
Multi-Tenancy: Mimir is designed with multi-tenancy in mind, allowing it to scale well in multi-tenant environments where data needs to be isolated between different tenants.
High Availability and Data Redundancy: With redundancy built into its architecture, allowing data to be replicated across multiple instances Mimir seems very suitable for situations where data durability and uptime are paramount
Integration with Grafana: Grafana is already well adopted within the company as the de-facto visualization tool and Mimir integrates seamlessly with it.

Cons

Deployment Complexity: Mimir's architecture, while scalable, is inherently more complex compared to simpler single-binary solutions. It requires managing multiple microservices such as Ingesters, Queriers, Distributors etc
Resource Consumption: Due to its microservices-based design, Mimir can be resource-intensive, requiring more memory and CPU for its various components.

Why We Chose Grafana Mimir

After evaluating the options, we ultimately chose Grafana Mimir because:

Scalability: Its' capability to scale effortlessly (as showcased in this official blog post by Grafana)
Multi-tenancy: Mimir offers all the native multi-tenancy features that we require in their open source version, therefore saving costs by not having to pay for an Enterprise version.
Maturity: Because the Mimir project has spawned from other already widely adapted products like Cortex and Thanos, it was battle tested and came with a significant community and support.

While there were some trade-offs, such as having to manage a number of components due to its microservices architecture, we believe that Grafana provides the best balance between scalability and cost. Additionally, its compatibility with Prometheus (use of PromQL, ability to lift and shift scrape config) and familiarity (Grafana has been widely used within the company for visualization) made it an ideal choice for our metrics requirements.

How We Implemented Grafana Mimir

The integration process for Grafana Mimir is still ongoing, details of which would require another blog post, however we have been so far very impressed with its easy set up (using the official Helm Chart by Grafana), Grafana Dashboards provided and documentation.

The early results looked quite promising in terms of ingestion rates and active series in a subset of the tenants we migrated.

We also encountered a few challenges, such as not having support for metrics federation, but we were able to overcome them by allowing cross querying for some tenants.

Conclusion

Choosing the right technology is never easy, especially with so many options available. While Grafana Mimir was the best fit for our needs, it's important to consider your own project requirements when making similar decisions. We hope this breakdown helps others facing a similar choice. If you have any questions or want to share your experience, feel free to contact us!

This article is provided as a general guide for general information purposes only. It does not constitute advice. CECG disclaims liability for actions taken based on the materials.

Evaluating Large Scale Solutions for Multi Tenant Metrics System

Complexity in Management

Resource Duplication

Difficulty in Aggregating Metrics

Scalability Concerns

Evaluation Criteria

Options Considered

1. Prometheus + Thanos

Pros

Cons

Summary

2. Victoria Metrics (VM)

Pros

Cons

Summary

3. Grafana Mimir

Pros

Cons

Why We Chose Grafana Mimir

How We Implemented Grafana Mimir

Conclusion

Continue Reading