October 20, 2014

“Shard early, shard often”

I wrote a post a while back that said why you don’t want to shard.  In that post that I tried to explain that hardware advances such as 128G of RAM being so cheap is changing the point at which you need to shard, and that the (often omitted) operational issues created by sharding can be painful.

What I didn’t mention was that if you’ve established that you will need to eventually shard, is it better to just get it out of the way early?  My answer is almost always no. That is to say I disagree with a statement I’ve been hearing recently; “shard early, shard often”.  Here’s why:

  • There’s an order of magnitude better performance that can be gained by focusing on query/index/schema optimization.  The gains from sharding are usually much lower.
  • If you shard first, and then decide you want to tune query/index/schema to reduce server count, you find yourself in a more difficult position – since you have to apply your changes across all servers.

Or to phrase that another way:
I would never recommend sharding to a customer until I had at least reviewed their slow query log with mk-query-digest and understood exactly why each of the queries in that report were slow.  While we have some customers who have managed to create their own tools for shard automation, it’s always easier to propose major changes to how data is stored before you have a cluster of 50+ servers.

About Morgan Tocker

Morgan is a former Percona employee.
He was the Director of Training at Percona. He was formerly a Technical Instructor for MySQL and Sun Microsystems. He has also previously worked in the MySQL Support Team, and provided DRBD support.

Comments

  1. It’s good to think about sharding from early on, that is – keep it in mind.
    Doing it early on tends to also be a bad idea because you don’t yet know where the nasties will be when things grow.

    For instance, originally independent components (good candidates for functional sharding!) may end up needing to be tightly integrated (joins) to deliver what users end up doing with your system.

  2. Michael says:

    I agree that sharding simply for the sake of it can create more problems and too much work early on when resources are scarce. However, there are at least three good reasons to shard early — or at least run your dev environments sharded, even if you have a single “shard” in production.

    1) It’s too tempting to produce joins or otherwise assume all data is in the same place, especially if you’re using an ORM layer (developers tend not to look at the resulting queries too closely). This makes it difficult to shard later (or vertically partition), and almost impossible to do it quickly.

    2) The tools you use for data access layer may not handle sharding correctly, or may need significant changes, or may result in different idioms. This also makes it difficult to shard later, and will involve significant retraining.

    3) You may expect significant traffic bumps as your company gains publicity. Forget the cost for a moment, it’s simply less work to add DB servers into the mix than to upgrade to one with more RAM/CPU/disk or whatever it is you need.

    But, truthfully, the best investment a new company can make is hire smart people with experience in both scaling systems and optimizing performance. :)

  3. Mark Callaghan says:

    Several things can delay the need to shard: affordable RAM, affordable flash storage, InnoDB plugin or XtraDB and smart people including expert consultants. Hopefully all of these are given proper consideration. Sharding is usually much easier than re-sharding (splitting) data on an overloaded shard. The plan to shard must include a plan to reshard.

  4. Your first bullet point is mostly true… definitely nothing replaces good data design with effective use of indexes. However, there are some reasons why you want to shared (or at least build it) from the start. Your second bullet point is true, but misses the advantage. Having multiple servers allows you to experiments with different options, and if done properly, lets you choose the best optimization sooner. Depending on how the sharding (or combined with replication) is done, you may also be able to create an index (which locks a table) but not face bringing the application down to do it.

    One more point that is somewhat overlooked is stress. It is much easier to think about and plan out a good sharding model when your do not have to deal with an application that has performance problems, all while the boss wants to add more features. If you do leave sharding for later, make sure everyone on the team knows about the potential technical debt, and it is accounted for. You don’t want to have to rush into sharding under pressure/stress.

  5. Michael and Arjen: You raise good points about not introducing functionality that could block sharding if you’ve established you need to shard at some point. That can be a case where prolonging pain causes more pain – and I’ve certainly seen that happen.

    My main message here is really in response to a repeating pattern on systems I’ve had to work on. If you are using a pretty standard box with 1-2 spindles and 4-8G of RAM, there are a lot more opportunities open to you before sharding. Most of the items in Mark’s list (comment #3) were not available 2-3 years ago.

  6. Peter – Yes, having multiple machines certainly is helpful in A/B testing changes, but at the simplest level tuning queries with mk-query-digest is fairly safe unless you go too far with indexes with a write-heavy workload.

    There’s nothing wrong with planning to shard early – just hold off implementing it while you can. As Arjen wrote, users can surprise you – the components you thought wouldn’t grow exponentially do.

Speak Your Mind

*