In this blog post, I’ll discuss MySQL sharding models, and how they apply to SaaS application environments.
MySQL is one of the most popular database technologies used to build many modern SaaS applications, ranging from simple productivity tools to business-critical applications for the financial and healthcare industries.
Pretty much any large scale SaaS application powered by MySQL uses sharding to scale. In this blog post, we will discuss sharding choices as they apply to these kinds of applications.
In MySQL, unlike in some more modern technologies such as MongoDB, there is no standard sharding implementation that the vast majority of applications use. In fact, if anything “no standard” is the standard. The common practice is to roll your own sharding framework, as famous MySQL deployments such as Facebook and Twitter have done. MySQL Cluster – the MySQL software that has built-in Automatic Sharding functionality – is rarely deployed (for a variety of reasons). MySQL Fabric, which has been the official sharding framework, has no traction either.
When sharding today, you have a choice of rolling your own system from scratch, using comprehensive sharding platform such as Vitess or using a proxy solution to assist you with sharding. For proxy solutions, MySQL Router is the official solution. But in reality, third party solutions such as open source ProxySQL, commercial ScaleArc and semi-commercial (BSL) MariaDB MaxScale are widely used. Keep in mind, however, that traffic routing is only one of the problems that exist in large scale sharding implementations.
Beneath all these “front end” choices for sharding on the application database connection framework or database proxy, there are some lower level decisions that you’ve got to make. Namely, around how your data is going to be led out and organized on the MySQL nodes.
When it comes to SaaS applications, at least one answer is simple. It typically makes sense to shard your data by “customer” or “organization” using some sort of mapping tables. In the vast majority of cases, single node (or replicated cluster) should be powerful enough to handle all the data and load coming from each customer.
The next set questions you should ask yourself are around your SaaS applications:
I address the answers in the sections below.
How much revenue per customer you’re generating is an important number. It defines how much infrastructure costs per customer you can afford. In the case of “freemium” models, and customers generating less than $1 a month an average, you might need to ensure low overhead per customer (even if you have to compromise on customer isolation).
How much revenue per customer you’re generating is an important number. It defines how much infrastructure costs per customer you can afford. In the case of “freemium” models, and customers generating less than $1 a month an average, you might need to ensure low overhead per customer (even if you have to compromise on customer isolation).
Typically with low revenue customers, you have to co-locate the data inside the same MySQL instance (potentially even same tables). In the case of high revenue customers, isolation in separate MySQL instances (or even containers or virtualized OS instances) might be possible.
Isolation is another important area of consideration. Some enterprise customers might require that their data is physically separate from others. There could also be government regulations in play that require customer data to be stored in a specific physical location. If this is the case, you’re looking at completely dedicated customer environments. Or at the very least, separate database instances (which come with additional costs).
Customer size and requirements are also important. A system designed to handle all customers of approximately the same scale (for example, personal accounting) is going to be different than if you are in the business of blog hosting. Some blogs might be 10,000 times more popular than the average.
Finally, there is a there is the big question of whether all your customers are running the same database schema and same software version. If you want to support different software versions (if your customers require a negotiated maintenance window for software upgrades, for example) or different database schemas (if the schema is dependent on the custom functionality and modules customers might use, for example), keeping such customers in different MySQL schemas make sense.
This gets us to the following sharding isolation models, ranging from lowest to highest:
The farther you go down this route – from the shared schema to an environment per customer – the more important is to have a high level of automation. With a shared schema, you often can get by with little automation (and some environments manually set up) and all the schema’s pre-created. If customer sign up requires setting up dedicated database instance or the whole environment, manual implementation doesn’t scale. For this type of setup, you need state-of-the-art automation and orchestration.
I hope this helps you to understand your options for MySQL sharding models. Each of the different sharding models for SaaS applications powered by MySQL have benefits and drawbacks. As you can see, many of these approaches require you to work with a large number of tables in the MySQL – this will be the topic of one of my next posts!
Nicely articulated article. There are few use cases where data segregation can be a challenge. The above use cases described answer most of the challenges in Saas Product for Enterprise clients (B2B). However in a B2C use case, segregation by customer or instance per customer will not be feasible. I have tried with Hashing algorithms – for example -hash calculation owned at the application server level , combined with data router (set of tables that has information of shards for given data ranges), but these tailor made solutions pose maintenance problems and get complex over a period of time. Is there a linear way to scale writes for this use case? Time based shards is another option, but do you think there are better options available for us? Thanks for your time!
Lots of great points here, highlighting the effort required to shard- ie, gain write-scale for MySQL deployments.
Also great point re. the ubiquity of the need, yet the paucity of widely-used, OOTB sharding solutions. DBAs/DevOps are thus unfortunately in the position of having to ‘roll-their-own’ sharding topologies, which adds even more complexity to their deployments.
An additional important consideration is the amount of application changes required to enable sharding- as data is partitioned across shards, cross-shard transactions don’t enjoy ACID guarantees by the RDBMS anymore; typically the trade-off is between Availability -vs- Consistency. Which is usually fine for reporting and analytics, but can be terrible for E-commerce and Gaming. There are MySQL-compatible scale-out databases providing the write-scale of sharding, without any of the above headaches.
Hi Peter, that’s a great introduction post on the subject and thanks for that but could you add some links to related posts or books to go further in the understanding of the process especially for the schema by customers and the shared schema?