Infrastructure at Scale
Many operations teams are tasked with competing goals of quickly delivering reliable infrastructure and also ensuring high uptime for all infrastructure components. By the time an organization has more than 50,000 hosts, this process is very well defined, usually the result of some very expensive mistakes made along the way. The purpose of this talk is to share best practices and lessons learned as we have scaled many organizations from tens to thousands of servers. We'll ask, and answer, questions like:
- How & what should we automate?
- What metrics matter?
- When does change control process become important?
We'll also talk about the rule of threes, why user ids are important, why time matters, and many other details that are often overlooked at the early phases of scaling.