If you have been around MySQL for any length of time, you are probably aware that choosing the correct data types and optimizing your schema are actually important tasks. A few years back at Percona Live 2016, I gave an introductory talk on schema review and optimization. Here’s the video:
I was thinking about that talk in the context of some of my current clients. Though I had worked on extremely large database deployments during my earlier tenure at Percona, it was often more of an outlier. Now, working as a Technical Account Manager with our largest clients, it is much more common.
The Fundamental Problem
I’d like to expand my thoughts on the “choosing the smallest data type you can” principle from my 2016 slides through the lens of a few of my 2019 clients. I gave an example of two copies of the same table (a simple 4 column, 3 index table with ~4 million rows), one using a bigint for the primary key and one using a regular unsigned int for the primary key:
[root@sample-host plive_2016]# ls -alh action*ibd
-rw-rw---- 1 mysql mysql 908M Apr 7 16:22 action_bigint.ibd
-rw-rw---- 1 mysql mysql 636M Apr 7 16:23 action.ibd
In this example, there was almost a 30% space savings associated with using the smaller data type. Obviously, at the scale of 1GB of space, this is trivial. One comment I made during the talk references the adage “storage is cheap”. While this can be true at a certain scale, I’m seeing this thinking break down more frequently with my largest clients.
The Problem Magnified at Scale
As an illustration, one of my clients is running roughly 10,000 Percona Server instances in their sharded deployment. These servers are running on bare metal with above-average hardware (for performance concerns). While that sounds like a lot of servers (and it definitely is), you also have to take into consideration other operational concerns such as backups. For the sake of some easier math, let’s assume the following:
- 5 servers/shard
- 500G data directory
- 5 backups of each shard (various time ranges such as daily, weekly, monthly)
Using those numbers as an estimate, one would be looking at roughly the following for space:
- ~4.7 petabytes storage for running instances (SSD)
- ~6 petabytes storage for backups (HDD)
The Business Impact on Large Deployments
Suddenly, at that scale, the 30% space that seemed trivial in my example seems a bit more important. Let’s run some numbers based on the current $/TB pricing of SSD and spinning HDD with the 30% reduction in space:
- SSD Savings ~ $140,000
- $100/TB Cost
- 30% of 4.7PB = 1.4PB
- HDD Savings ~ $46,000
- $25/TB Cost
- 30% of 6PB = 1.8PB
Saving 3 petabytes of storage would reduce the raw storage costs by nearly $200,000. On top of the actual capital costs of 3PB of storage, you have to consider server count, power, and maintenance (among others) which would increase this cost significantly. Clearly, this is just a theoretical example of the potential impact a small mistake like picking the wrong data type can have on the overall infrastructure cost at scale. Generally, by the time a company reaches this scale, these sorts of optimizations have already been made and we need to start looking deeper and more creatively at other alternatives.
While this is definitely an oversimplification of calculating storage costs based on raw hardware prices, it does beg the question: even though it may not seem important now, why not ensure your schema is optimized before it becomes an expensive problem to fix!