Every day hundreds of millions of dollars are wasted by allowing improperly tuned or misconfigured systems, misunderstood infrastructure, and inefficient IT operations to live and thrive in data centers around the globe. There are both direct and indirect costs associated with allowing these unhealthy systems to continue to exist. Let’s look at some.
Let us start by using a small example. We will start by looking at a small database setup. This setup will have a single master-slave, with a database size of lets say 500GB. Traffic is steady and let’s say this translates into 500 IOPS on the master. You have chosen to host this on Amazon’s AWS. A common way of ensuring backups occur in AWS is to setup ebs snapshots of the slave. In terms of usage, let us assume your CPU is about 50% used and you have about 20GB of hot data that needs to stay in the memory for the database.
If we look at what this would take to support in EC2 you are looking roughly at this:
- 2 c3.4xlarge servers (16 vcpu, 30GB of memory )
- Master-Slave Set
- with 1TB of Provisioned IOPS SSD, over 2 volumes
- with 500 IOPS on the master, 125 iops on the slave
- estimated 7TB of storage for snapshots
This calculator gives us an estimated cost of $3,144.28 per month, or roughly $38,000 a year in hosting fees. Note that you can choose other tiers of service, or reserved or spot servers to get different pricing.
Regular, steady growth:
Now let’s assume your database is growing along with its traffic at about 5% per month (these are rough numbers I know). After a year your database server would be out of steam using 86% CPU, 34GB of hot data (so relying more heavily on disk), and be consuming just about 850GB of storage space. Moving up to the next tier of servers and with additional iops you will see your spend per month jump to around $4,771.32 per month ($57,000 per year).
When tuning and auditing an environment like the above we been able to give some customers up to 50% or more improvement in performance, and often see 20-25% reduction in space. Let’s be conservative and say we can get a 25% boost in your performance, reduce your 5% monthly growth to 4%, and shrink your database by 10%. Based on that you can stave off upgrading your servers an additional 9 months, saving you almost $15,000 in that first year alone. Over 4 years this customer would end up saving an estimated $75,000 in total spend in AWS costs just based on smaller data and performance enhancements.
In this case performance enhancements are not the only place to save costs. Moving from EBS Snapshots to regular MySQL backups using Percona XtraBackup, keeping one copy on disk and sending those backups to s3, the cost of the environment drops to $2,043.87 per month ( from $3144.28). This means a simple switch of backup methodology can net you about $1,100 a month or $13,200 a year off your hosting bill.
Often we are not only reducing the resources needed, but we can also reduce the number of servers needed to run your application through tuning. We had a recent client who was able to see a 90% reduction in their read heavy workload and actually turn off servers that used to be used to serve their application. Here is what their savings over the next couple of years would look like:
Here we helped cut this customers direct costs by two thirds.
The one thing to keep in mind is this assumes a linear growth in terms of application and database usage. This would mean you can predict when you will need hardware. If your user base is growing and feature adds are controlled it is possible, however in most environments you will not see that linear growth. You will probably see something like this:
Understanding this pattern and the spikes are vital to keeping your costs down. See that giant spike up to 2,500? The first reaction for many is upgrade their hardware, then tune. Inevitably any tuning benefit is offset by the already sunk cost of the hardware upgrade which after the tuning they may not have needed. Getting in front of that spike and preventing it could have saved tens or even hundreds of thousands of dollars.
Spikes kill performance and cost real dollars. Those spikes may not be easy to find. A few years ago I was working with a Fortune 500 client who had one of these spikes. They had been running perfectly fine with steady but controllable growth for 7 or 8 months, then the 9th month things went very wrong very quick.
A critical component of their company was to certify professionals through a testing application. During their peak time of season these certifications had stopped completely – delaying certifications for thousands of employees and clients for 2 weeks. I was flown out to help control the bleeding and hopefully fix the problem.
The number of users using the application was the same, the number of page views on the web was steady, but the number of queries to the database skyrocketed. None of the queries had hit their threshold of 1 second to be flagged as problem queries. It turned out to be one query that took 250ms to execute that was causing this company to grind to a halt. That one query ended up being executed 25,000 times per page when certain conditions were met, and those conditions were not met until the 9th month after this application was re-released.
This query lay like a trojan horse waiting to destroy this company’s ability to deliver to its customers.
Two lessons can be learned from this. The first is even a seemingly well-tuned system may not be. Second, small things matter. In this case fixing the code is the correct solution, however, proper indexing of the tables dropped the query time from 250ms down to 50ms. This was enough of a relief to allow the certification process to start up again until the code could be fixed. A seemingly small impacting query still should be optimized.
Another source for these performance spikes is a company’s application release cycle. Applications are very a living entity in today’s world. They grow and expand and change on a regular basis. In order to stay ahead of any problems you need to have a process and resources in place that can proactively monitor and tune. Every release of new code should be going through a rigorous performance review to prevent trojan problems that may cause problems and extra costs down the road.
Indirect and hard to calculate costs:
All of this discussion so far has been around direct hosting costs. There is also a cost to your reputation and your ability to deliver services that meet a customer’s expectations. Customers who come to your site or are using your application can leave in droves due to poor performance. We have seen several customers who lost 50% or more of their user base due to performance problems with the application.
Lost revenue and profits are often much more difficult to quantify, and vary greatly from company to company. This cost, however, is very real. Silicon Valley is littered with the remnants of companies that did not plan to address scale or simply missed important problems in their IT infrastructure. Unfortunately I have worked directly with numerous companies that learned this lesson the hard way. These hidden costs can kill a customer quicker than any competitor or market shift.
One of the biggest hidden costs companies needlessly pay is the cost of downtime.
The cost of downtime:
I was reading a gartner study where they estimated that the cost per minute of downtime was $5,600 dollars; other studies, like this one, have pegged the cost per minute of downtime at $7,900.
Anyway you slice it being down for even a minute costs you money. If we are conservative in our estimates, the cost of an hour of downtime can easily top $100,000. It’s amazing the number of well-established companies that don’t have a solid plan for dealing with downtime.
Let’s look at some common disaster recovery policies:
Restore from backup:
How quickly can your DBAs get alerted to an outage, then login to look at the outage, and finally make a call whether or not to restore? I submit that most people are going to take a few minutes to get an alert (let’s say 2). They will then take a few minutes to get to the computer and into the system (let’s say 5 minutes). Then they will take at least 10 minutes to try and figure out what’s going on. Fast-forward 17 minutes later…. minimum has gone by with nothing to show for it.
Restoring the backup itself could take a few minutes or several hours. Let’s just say 40 minutes total. If we use that $7,900 number, you could have just lost $316,000. That’s a huge amount that could have easily been avoided. Maybe you know that you’re not losing $7,900 a minute, maybe it’s only $1,000. That’s still $40,000!
Manual failover to a slave:
The time for getting, reacting and taking action does not change in this equation. The original 17 minutes of time (minimum) to react and start fixing just potentially cost you $134,300.
Not all automated failovers are created equal. Some solutions can take several minutes to even hours to restore proper service (passive cold slaves warm up time). Just because you think you are protected does not mean you are. Having the right automated solution can mean you minimize your downtime risks to $10K or less, having the wrong one can be worse than having none at all.
It’s important to understand the cost of downtime and pick the proper solution to mitigate it.
Cost of being wrong is high:
These are just a few of the costs that companies can incur by having the improper database and infrastructure setup. Mitigating these costs requires a solid process, a high-level of expertise, and the right resources in place.