How we scaled to 1M events per second

When we first launched Acme, our event ingestion pipeline could handle about 10,000 events per second. That was plenty for our early customers. But as we grew, we started hitting bottlenecks that forced us to rethink our entire approach.

The first thing we learned is that vertical scaling has hard limits. No matter how beefy we made our ingestion servers, we'd eventually hit a ceiling. So we moved to a horizontally scalable architecture based on Apache Kafka and a custom partitioning scheme that distributes load evenly across our cluster.

The second breakthrough came from rethinking our storage layer. We moved from a traditional relational database to a columnar store optimized for time-series analytics. This alone gave us a 10x improvement in query performance and dramatically reduced our storage costs.

Today, our pipeline processes over 1 million events per second at peak load, with a p99 latency of under 50ms from ingestion to queryability. We're proud of what we've built, and we're excited to share the technical details with the community.

How we scaled to 1M events per second

Building a design system from scratch