How to Take Prometheus Planet-Scale: Massively Large Scale Metrics Deployments
How to Take Prometheus Planet-Scale: Massively Large Scale Metrics Deployments
Observability at eBay has been on an exponential growth curve. What was a low 2M/sec ingest rate of time series in 2017 is now roughly 40M/sec with active time series close to three billion. Our current cortex-inspired architecture of Prometheus builds sharding and clustering on top of the Prometheus TSDB. It's relatively simple to shard/replicate tenants of data in centralized clusters. However, large clusters with growing cardinality become less useful as query latencies degrade considerably. In 2020, Google published a paper on its time-series database Monarch, dubbed a planet-scale TSDB. The paper gave us some useful hints on how we could decentralize our installations and go fully planet scale. We started with a prototype to federate queries to TSDBs from different cities. Now, it lets us deploy our TSDBs anywhere using Kubernetes operators and Prometheus. This session focuses on the planet-scale architecture of our metrics platform, how GitOps has facilitated absorbing the complexity of massive deployment, and more.