November 21, 2014

Why you don’t want to shard.

Note: This blog post is part 1 of 4 on building our training workshop.

The Percona training workshop will not cover sharding. If you follow our blog, you’ll notice we don’t talk much about the subject; in some cases it makes sense, but in many we’ve seen that it causes architectures to be prematurely complicated.

So let me state it: You don’t want to shard.

Optimize everything else first, and then if performance still isn’t good enough, it’s time to take a very bitter medicine. The reason you need to shard basically comes down to one of these two reasons:

  1. Very large working set – The amount of memory you require to keep your frequently accessed data loaded exceeds what you can (economically) fit in a commodity machine. 5 years ago this was 4GB, today it is 128GB or even 256GB.  Defining “working set” is always an interesting concept here, since with good schema and indexing it normally doesn’t need to be the same size as your entire database.
  2. Too many writes – Either the IO system, or a slave can’t keep up with the amount of writes being sent to the server.  While the IO system can be improved with a RAID 10 controller w/battery backed write cache, the slave delay problem is actually very hard to solve. Maatkit has a partial-solution (via Paul Tuckfield), but it doesn’t work for all workloads.

(Yes, I am simplifying some of the scalability issues with MySQL on big machines, but I have faith that Yasufumi is making this better).

What types of Sharding are there?

Despite my cautions, if you have established that you need to shard there are quite a few options available to you:

  1. Sharding Partitioning by Application Function – This is usually the best way to fix any of the problems mentioned above. What you do is pick a few very busy tables, and move them onto their own MySQL server.  Partition-by-function keeps the architecture still simple, and should work for most cases unless you have a single table which by itself can’t fit into the above constraints.
  2. Sharding by hash or key – This method works by picking a column on a table and try and divide up your data based on it.  You can choose any column to hash on, you just need to make sure that it will equally distribute the data equally. In practice this method can be really hard to get working right, since even if each shard has the same amount of ‘customers’, demanding users tend to by far exceed average users and some servers are overloaded while others are not.

    (Tip: There are a few famous cases of both (a) bad hashing algorithms and (b) users becoming unequal all of the sudden;  You don’t want to shard based on the first character of a username – as there will be a lot more ‘M’ than ‘Z’.  For users becoming unequal all of the sudden, it’s always interesting to think of what scaling challenges Flickr would have had for the official Obama photographer in the lead up to the 08 election.)

  3. Sharding via a Lookup Service - This method works by having some sort of directory service which you query first to ask “what shard number will this users data exist on?”.  It’s a highly scalable architecture, and once you write scripts to be able to migrate users to/from shards you can tweak and rebalanced to make sure that all your hardware is utilized efficiently.  The only problem with this method is what I stated at the start: it’s complicated.

(Note: I’ve left out some of the more complicated sharding architectures.  For example; another solution is to have shards all store fragments of data, and to cross backup those fragments across shards.)

Why is it so complex?

The reason it’s complex comes down to two reasons:

  1. The application developer has to write more code to be able to handle sharding logic (this is actually lessened with projects such as HiveDB.)
  2. Operational issues become more difficult (backing up, adding indexes, changing schema).

I think that a lot of people remember (1), but (2) can be a real pain point.  It can take a lot of work to build an application that works correctly when you are rolling through an upgrade where the schema will not be the same on all nodes.  A lot of these tasks remain only semi-automated, so from an operations perspective there can often be a lot more work to be done.

This concludes Part 1 – I hope I’ve justified why we are not covering sharding.  In Part 2, I will write about something that is going to be in the course – “XtraDB: The top 10 enhancements”, and in Part 3 “XtraDB: The top 10 parameters”.

About Morgan Tocker

Morgan is a former Percona employee.
He was the Director of Training at Percona. He was formerly a Technical Instructor for MySQL and Sun Microsystems. He has also previously worked in the MySQL Support Team, and provided DRBD support.

Comments

  1. sudhir says:

    “(Tip: There are a few famous cases of both (a) bad hashing algorithms and (b) users becoming unequal all of the sudden; You don’t want to shard based on the first character of a username – as there will be a lot more ‘M’ than ‘Z’. For users becoming unequal all of the sudden, it’s always interesting to think of what scaling challenges Flickr would have had for the official Obama photographer in the lead up to the 08 election.)” – Probably I am reading an old article -
    can sharding on reverse of username work. It should be more random?

  2. I couldn’t agree more. Sharding is complicated and is highly likely to introduce bugs into your application. These bugs will be subtle, edge-cases, for example, where a customer has moved to a different server and then want to report on data which cross the boundary when they moved. These are going to be difficult bugs to find and very hard to fix.

  3. morgan says:

    @Mark – I didn’t mention bugs, but I’m with you on this one. With sharding (or complicated replication setups) having an application error/disaster mangling your data is not fun. Trying to sift through and re-piece it together is like pulling teeth.

  4. Ori Lahav says:

    thanks Morgan for the great post.
    I would like to hear your opinion on MySql Cluster – isn’t that the way to do sharding?

  5. Morgan Tocker says:

    @Ori – thank you.

    In response to MySQL Cluster, I throw this into a different category of sharding (although from a technical perspective it is method 2: sharding by hash or key). But my reason why it is different is that by managing the sharding internally some of these pains have been taken out for you.

    Having said that: I’d still consider “migrating to MySQL Cluster” as a similar decision to “migrating to another database” – there are a lot of workloads it isn’t that well suited for, and it can be difficult to simply retrofit into an existing application. I’d like to see join performance (via hash joins) improved for example, but there are a few more issues than that.

  6. Brooks Johnson says:

    I agree with the overall statement that sharding is often overkill, I’m not as fond of partitioning by application as it can often lead to complex issues. For example, if you need to join two large tables that are on separate machines one ends up writing the join in the application server, which is far more complex than a simple sql statement. Effectively, one is no longer using the database to full potential and it isn’t that much different from a key/value store. If what you effectively have is a key/value store, why not use a key/value store optimized data store (one with range scans)?

    Alternatively, with sharding, one can still join various tables other if the sharding was done by a good key. Sharding keeps much of the coding flexibility, but as pointed out, with a heavy, often too heavy, operational cost. I’m not saying that application partitioning doesn’t make sense in many or even most cases, but that sharding can make more sense in a relationship rich application where the joining of those relationships is important.

  7. Morgan Tocker says:

    @Brooks – It would depend on the application if it’s possible – but you’re right, even with Partition by function you loose some functionality (although less than the other sharding types). Many applications have different requirements that allow you to separate out this functionality reasonably elegantly;

    – Logging and searching are two problems that are normally pretty expensive, and a lot of applications have to solve them.
    – If we look at something like this wordpress install, the comments could reasonably safely be moved to another server/group servers.
    – If we look at something like Facebook/Flickr/Livejournal the friendship system is usually expensive enough that it can be moved too.

    I think that you can strike a good middle ground before having to throw out the relational model completely.

  8. Great article. We are indeed working on a large working set, well beyond 256G, and I can relate to the issues you listed.

    Interesting point you made about the first character of usernames… I have a question — what about creating a seperate table for each user?

  9. Morgan Tocker says:

    @David: That would depend on the number of users ;)

    It’s a common question with no answer that is correct 100% of the time.

    If there are too many, you end up with a lot of small files. In MySQL-terms that means at least 1 file per table (.frm), and a lot of pressure on your filesystem to have that many in one directory (a good rule of thumb is probably aim for less than 1000 per directory, but each filesystem implementation is different). There’s also a lot of problems with opening tables scalability in MySQL. See http://www.mysqlperformanceblog.com/2006/11/21/opening-tables-scalability/ and http://www.mysqlperformanceblog.com/2009/06/16/slow-drop-table/ for example.

  10. Morgan Tocker says:

    @David: I think I hit ‘submit’ too quickly:

    Another problem that you need to be careful of, is how much memory InnoDB allocates for data dictionary (it’s unlimited in size!). We have a patch that addresses it here:
    http://www.percona.com/docs/wiki/patches:innodb_dict_size_limit

  11. Peter van Dijk says:

    For those of us dealing with enormous working sets, there really isnt any other option than to shard, but I dont think sharding needs to be as doom and gloom as you point out.

    I think that “complexity” as you mention it, is probably too broad of a term to use in reference to the structure of sharding systems.
    Yes, they are complex, but that’s not necessarily a bad thing. Mysql is complex. that doesnt make it bad.
    The real danger of implementing sharding is bad design. Bad design is ‘bad’ complexity.

    In relation to your article, there’s a number of things which i wanted to mention, in no parcticular order:

    – I disagree with the assumption that putting your entire database on a single server is a good idea, for the exact same reason that, even though you could have a single web server with 1000 xeon’s in it and a TB of memory, doesnt mean that it’s a good idea. There are a lot of people out there that cant see out from the confines of their database infrastructure. I’d much rather have 4 webservers than 1. Same applies to database infrastructure.

    – Complexity (read: difficulty of use) for application developers depends on the application and framework design, not the fact that it’s sharded. If you design your sharding infrastructure with simplicity for developers as a primary goal, it’s possible to implement frameworks where the developer only needs to have a rudimentary understanding of the database infrastructure (even though a full understanding of how things work is desired. Still, our new developers are able to use the system straight out of the induction process). It’s just as easy to have a bad unsharded framework as it is a bad sharded framework.

    – Backups arent all that scary, just that you have to do them from more than one server. If you’re considering sharding, you may want to ensure that you have good automation for these sorts of things well in advance, not to mention things like good monitoring systems.

    – Schema changes and other migrates are not always that much of a problem. If your entire development team is on the mysql command line making live schema changes, yes, you’re going to have issues. We found that the best approach is to treat database changes as a part of the application. Our developers write code to change the database, not queries. This gives error handling, logging, full historys of everything that gets changed, and a number of other benefits. In terms of shards, it allows us to keep all of the shards consistant in a much easier and managable way.

    – In organisations of varying skill level, it’s important to have development processes in place to cope with the fact that not everyone is going to understand everything that happens at the database level. This can potentially become more important once sharding is implemented.

    – Some days i wish i had a bigger budget. There’s some really cool things you can do in terms of sharding solutions that you could implement at the mysql or storage engine level, but they’re a bit to complex to discuss here :)

    In summary, I think what I’m trying to say is that:
    – Like you said, most people shouldnt be sharding. It’s really only a solution for the high end. Make sure you’re capable of fully understanding your entire architecture.
    – Complexity isnt evil, but bad design is. If you do have to shard, 90% of your work should be in the design, not the implementation, otherwise you’re going to end up with all those problems you mentioned in the article.
    – There are also some other benefits to sharding which you dont list here. Shards can be an amazing tool when dealing with clusters in different geographical locations.

  12. morgan says:

    @Peter: It’s good to have disagreements – it sparks discussion!

    There’s a difference between saying “you don’t want to shard” and “you shouldn’t shard”. I think my wording is very careful here, and the point is that there will be some situations where you still have to. There are people with working sets larger than 256G (including yourself), but a large number of people will also have less than this number.

    Having your data in one place does not necessarily mean having no redundancy. In my point about “too many writes” I explicitly mentioned slaves, which means I still want replication. Where you said “I’d much rather have 4 webservers than 1. Same applies to database infrastructure.”, this is something I disagree on. Yes – I’d rather have more webservers, but that is because webservers have no affinity. Replication is like RAID1, Sharding* is like RAID0. If you have x10 as many shards, then you actually have a much higher chance of failure.

    * In the examples I provided, I would typically use replication as well.

  13. Peter van Dijk says:

    Yeah, i wasnt disagreeing with you on that first one at all. I definately agree that most people ‘shouldnt’, ‘dont want to’ and ‘shouldnt want to’ shard. same thing. different words. That being said, i’m still a great proponent of their use where it’s warranted.

    RE: “In the examples I provided, I would typically use replication as well.”, and “Having your data in one place does not necessarily mean having no redundancy.”,

    I couldnt agree more, but this isnt specific to either centralised database or to shards.

    Having 4 database servers or shards shouldnt mean no redundancy. In an ideal world, i’d typically assign a hot spare that can take the role of an existing shard in the case of a failure. Again, this is a design level problem. Using sharding as a tool doesnt prohibit you from designing a redundant, fault tollerant infrastructure.

    I think your comments on failure rates are a little too simplified and dont really show the entire problem. Even though it increases the chance of a failure, it also lessens the impact by an equal amount. For example, in the case were you do have 10 shards, if one fails, only 10% of your customers would be affected at all. If you factor replication and/or failover to a hot spare on top of this, it actually looks more like RAID10 rather than RAID0.

    Of course, this can add complexity, but it doesnt need to be badly designed. You can automate these kinds of processes. It’s like anything else though: how much do you want to spend, and how much time do you want to spend doing it. Horses for courses.

    Those points aside,
    I think what it really comes down to, is that once you start doing the scalability research for your application, you’ll find that there are any number of different approaches that you can take. Sharding may not be appropriate for any of the points that either of us have made,
    but I like it because it gives me the opportunity to do many things that a centralised database infrastructure does not – such as partitioning shards geographically while still maintaining centralised management, let alone things like performance and fault tollerance, possible increases in data integrity, and loads of other stuff i cant remember :)

  14. Andy says:

    Morgan,

    You mentioned HiveDb in your post. Is HiveDB still alive? They made some announcements a long time ago and there has been no activity on their site for over a year. No code released.

    So HiveDB seems pretty dead. So does Hibernate Shard.

    What are some good frameworks that support sharding?

  15. morgan says:

    @Andy – I included HiveDB as an example, since I’ve known people have used it in the past… but over 1 year of inactivity could be death in open source terms. I’m not sure of any other good examples – most of the stock frameworks are actually pretty bad at supporting read/write splitting let alone sharding.

  16. TS says:

    @morgan:

    I agree that applications should avoid sharding for as long as possible.

    Recent development in the MySQL and File Systems point toward migration to ZFS hybrid storage pool for accelerating synchronous writes and L2ARC read accleration.

    Smugmug already migrated to ZFS hybrid storage pool, although I disagree with the NFS methodology. Since NFS uses 10Gbit switches and adapters, which are about 500 dollars a port and you need dual adapters and redundant switching…that’s 8 ports total to connect two machines with redundancy = $4000 total. That is far more expensive than dual path SAS HBAs for DAS JBODs. (You needed DAS JBODs on the ZFS NFS server too).

    I dream of one day you can run MySQL with perfect scalability with this machine setup:
    http://www.flickr.com/photos/gavinmroy/2871736448/

    That’s the standard DB config that PostgreSQL can scale to. 32 cores with 100 spindles. If we now use SSDs in the caching tier on top of it with X25-M/Es, then you can buy 500K IOPs easily.

  17. Dave says:

    I’ve not seen much mention of partitioning using MySQL 5.1 and above – how does that relate to the sharding issue, and is it a good intermediate step to work with before having to restructure applications to handle “proper” sharding across multiple data servers? Especially if you have budget constraints! :D

  18. morgan says:

    @Dave – partitioning *may* help buy you some time. Lets look at the two reasons that make us shard:

    1. Very large working set:
    – Partitioning may change some execution plans so that less indexes need to be loaded in memory. How true this is would depend on the queries.

    2. Too many writes:
    – There’s a chance partitioning might make some index updates cause a little less random IO, but this hopefully is already negated by the InnoDB insert buffer feature.

    There are some annoying limitations which make MySQL Partitioning nothing like in other databases (Oracle), but having said that I’ve also managed to find a few suitable use cases every now and then as well. Maybe there’s room for a followup blog post some time.

  19. Dave says:

    @Morgan,
    Thanks for the info – our working set is quite large, with a few tables hitting 150Gb (including indexes), and there are frequent data replenishes which delete a large chunk of information and then replace it with updated information. In addition, queries against this information are frequent and very varied!

    Unfortunately, InnoDb is a non starter here due to budget (and therefore storage) constraints, so we’re stuck with MyISAM for the time being.

    Roll-on a nice conclusive blog post re. partitioning (including advancements in MySQL 5.4), InnoDb, Memory usage, storage requirements, etc that I can take to the powers that be and get them to hand over some more cash! LoL!

  20. Dave, I honestly doubt that partitioning is the miracle solution for you. It is no silver bullet.

    I think if the powers that be understood the real cost of MyISAM, that equation would flip on its head.

    Maybe you can think about archiving.

  21. Morgan Tocker says:

    @Baron – thanks for being a little less politically correct than me ;) I re-read my comment and realized I said “it may help”, but I really meant “It may help, but I don’t think it will fix it”.

    @Dave – When Baron was talking about archiving, he was probably implying mk-archiver – http://www.maatkit.org/doc/mk-archiver.html. It’s an excellent tool if you find you don’t need older records.

  22. Dave says:

    @Baron and @Morgan,
    Thanks again for the info, we actually archive our data yearly, and only store about 4 rolling years of data – unfortunately, most (if not all) of it is in regular use…

    I have got some interesting numbers back from my analysis though;

    My main area of concern was the replacement of data with refreshed information. This involved DELETEing large chunks and INSERTing new data. There is a single field that is primarily used when doing the DELETE, which contains roughly 200 unique entries with a fairly even distribution (eg DELETE FROM table_name WHERE field_name = 5).

    After partitioning the data into 10 chunks based on the HASH of this field, we have experienced a four-fold improvement in the DELETE command, which is now able to DELETE 8.2m records from a 1.5b record table faster than a similar DELETE of 3.6m records from a 500m record table.

    Maybe our application just happens to be one of those lucky ones that partitioning _may_ help!

    I have yet to get some proper performance numbers on the SELECT side of the data, but I shouldn’t imagine it will be any slower than the current layout, and the data replace section is such a key one, I’m prepared to forego a small performance hit for the increases I’ve already mentioned.

  23. peter says:

    Brooks Johnson,

    The functional partitioning makes sense under 2 conditions

    1) The functional partitions are independent enough, hence you do not need to join data frequently between them at all. Putting different tables on different hosts is not the idea, putting “Forum” on one database host “Wiki” on another and “Bug System” on the third is.

    2) The gain you’re looking for is relatively small. It is often easy to find 3 independent functions with one of them responsible for 50% of the load (and hence split giving you double capacity) but getting 10x this way is rarely possible

  24. peter says:

    Peter,

    We’re not against sharding. In fact we help a lot of people how to shard properly. The problem is now it is such a buzz word so people with 1GB data set start sharing even if it is never going to grow over 10GB

    The bad design is one issue the other however simply working with sharded data. Really it is a lot depended on how tightly coupled is your data – for example hosting million of separate blogs is very easy to shard because there is no interdependencies.

    The large data also indeed causes operational concerns – databases in TB range are often not fun in MySQL due to challenges with backups and expecially things as alter table. http://www.mysqlperformanceblog.com/2006/10/08/small-things-are-better/

    With backups – the concern is cross box consistency. With single box you can restored backup from yesterday and it will be consistent (even though not up to date) – in sharded envinronment backups will correspond to different point in time and so would not be consistent.

  25. Log Buffer says:

    “On the MySQL Performance Blog, Morgan Tocker explains why you don’t want to shard. (It has nothing to do with The Dark Crystal, I already checked.) [...]”

    Log Buffer #158

  26. I do agree you should only use it if you need to do realtime queries that are user facing across a very large dataset (10’s of TBs)

    Sharding is super easy if you know what you’re doing. 1,2 are not an issue at all for me. I can isolate all traffic for super powers users to an in memory DB which will not be overrun if done correctly.

  27. Anthony Berglas says:

    MASTER/SLAVE REPLICATION

    You forgot to mention that if there are many more reads than writes (common case) then running slave, read only databases off a master provides scalability without having to resort to sharding.

    Also, Sharding only works if the shards are largely independent, eg. GMail user accounts. But sharding and an integrated system such as ERP is likely to slow it down as the shards need to communicate.

    Some databases (Oracle) can horizontally partition a table (and I hope thus a database) automatically based on key values. That is the right approach. Keep the logical/physcial separation. Google style sharding and big table are a hack that ignores 50 years of database theory.

    You also forgot to mention that if you take a couple of big tables out of a database, you loose locking and transactions. Not a good option. (Oh, I forgot, MySql does not have locking anyway ;).)

  28. Peter van Dijk says:

    @Anthony,
    I think it might be helpful to consider that sharding can be used as another level of abstraction in a complex system, specifically, (and obviously this is a fairly gross oversimplification, but probably still valid)

    Where a raw disk has a filesystem placed on top of it to aid in organisation of the underlying data,
    a database server typically will use table structures on top of a filesystem to further abstract the low level operations of storing information in files into something that can be searched, modified and more easily maintained in a structured form.

    Similarly, shards, when implemented in a useful way, are able to abstract a given system in such a way that you’re able to distribute storage across an arbitrary number of machines. In our case, we have shards in different physical locations, where things like replication are completely impractical.

    By extension, the reason that sharding isnt really a good idea for most people is the same reason that, for example, if you want to copy your holiday photos onto a usb thumb drive, you dont use a database to do it. In many cases, that extra level of abstraction is completely useless and simply adds complexity.

    There are a lot of people who have spent a lot of time researching this area, and, particularly in the web world, it is an invaluable tool for dealing with enormous data sets. I think the notion that it’s ‘hack that ignores 50 years of database theory’ probably just indicates the need for better education and understanding of how it can be used as a tool.

  29. Morgan Tocker says:

    @ Anthony – I wrote under bullet point 2, that sharding was often a response to being write heavy (“too many writes”). I didn’t forget about replication, this article just has a specific purpose ;)

    Most applications can be broken down into shards (see my comment #6 for examples), but I don’t dispute this can be difficult in others. The example I often give for an application that won’t shard is IMDB’s database. I don’t think there are many good ways to divide actors up, and the movies they star in.

    A small correction to your point about locking: Readers don’t block writers in InnoDB because of MVCC, but MySQL does have locking. Related to your point though is cross-box consistency, and it is an issue. Peter wrote about this in comment #23.

  30. Hello,

    I am also working with massive amount of data and I love sharding for a lot of reasons.
    The main one is that even though it complexify the operations, it gives you a lot more flexibility and scalability.

    I have worked with traditional approach and sharding approach. Right now, I work with a traditional company that has so much money that they would rather buff the hardware as crazy as possible to fit everything in a few boxes.
    I also work with less traditional and they prefered having a lot of boxes to serve shards.

    When I was called to the rescue to the non traditional company, it was primarily because of the complex operations: how to publish new objects to all the shards (keeping versioning as well) and how to publish or replicate scripts as well. True! it was a challenge but it was fun to find the solution.
    The second challenge was about datawarehousing transactional data to purge data on the servers.
    Bottom line, we had very few indexes and had pretty much servers about 16GB of data, representing each one shard while we were server billions of transactions per month!
    The more traditional management (the datawarehouse) was then the most difficult administration part (with defragmentation/archiving/partitioning, and getting datamart and datamining part.
    based on my experience, I definitely prefer sharding which gives you a more intuitive solution for high availability and high performing having active/active nodes… oops shards (;-)).
    Well you got my point ;-) It competes seriously peer-to-peer replication, mirroring, log shipping and clustering!

  31. Joe Dugan says:

    This is a good article, and gives some very good points on why not to shard. But there are many reasons to shard as said in the earlier posts. I have seen some dramatic performance increases on mySql and Postgresql. There is a very good company called dbshards.com that has some very impressive benchmarks. I would recommend people look at all the options before they decide not to shard.

    I also read their article on reliable replication. It seems much better than the standard mySql replication.

    Sharding often solves scalability issues without much headache.

  32. Glenn says:

    You forgot a major use case: locality. Having the backend server in the USA is bad for users in Italy, yet you often don’t want completely distinct backends–users in one place should be able to talk to users elsewhere, and users should be able to be transparently migrated if their locale changes.

  33. James says:

    Locality: Glen this use case is fine if your application is not interested in data in the other countries. If you can seperate the data in your application to geographical regions then you will obtain a performance gain, but with a lot of applications this is not possible. Users in England are interested in the data in Italy, so sharding your data based on country will not help as you will have many cross domain/shard queries. Mongo does not provide any locality functionality to meet the use case you describne, that I know of. Does anyone know if mongo solves the issue of locality?

  34. mahdi lashkari says:

    HI, I have a question, I have a very big table and I want to shard it into five tables, but I only have one host. I mean I only want to split my table into five table and split my data according to userId key. Does it improve database performance for INSERT and UPDATE command? I know that if I spread my data on multiple host, performance would improve but what in my case?
    Thanks a lot

  35. Michael says:

    NuoDB CTO’s video on (No) Sharding is very worthwhile to watch. You can find it at http://vimeo.com/68592778

Speak Your Mind

*