As a Technical Account Manager at Percona, I get the privilege of working with some of our largest clients. It is exciting to work where I get to see massive deployments that are pushing current utilization limits. In these environments, however, there are different sorts of challenges that the database teams often face:
- Automation in managing 1000s of servers
- Capacity planning
- Architecting for massively sharded environments
- Operational maintenance of the fleet
While these challenges aren’t unique to deployments, they become much more complex and frame interesting discussions. While not unique, the impact of fundamental problems is intensely magnified at scale.
You’ve likely had a bad query that wasn’t using an index make it into production before; the pager goes off at 2 am (30 minutes after a release) that one of your servers had a spike in CPU. You crawl out of bed, connect to the server, see 45 copies of the same query in the processlist and immediately see the issue. It’s a quick fix; you run the ALTER to add the index, the server returns to normal, and you go back to bed at 2:15 am.
Now, imagine that query was deployed to 1000 servers against a table that takes 18 hours to ALTER. The CPU spike brings the entire fleet comes to a screeching halt. That is a scenario that may involve some executives being “less than pleased” with the situation while derailing your week/month/career.
In short – scale really does matter. True, you don’t want to spend 6 months over-optimizing a new MVP that you are trying to get to market first. But that doesn’t mean you shouldn’t start optimizing it sooner rather than later.
More posts from the TAM team on the way
This series of posts from our TAM team aims to highlight the potential impact of fundamental issues in large deployments. Other topics will include discussion about the different types of challenges we see working with clients operating at web-scale. Keep your eyes open the first posts in this series over the coming weeks!