## Can Shard-Query scale to 20 nodes?

Peter asked this question in comments to to my previous Shard-Query benchmark. Actually he asked if it could scale to 50, but testing 20 was all I could due to to EC2 and time limits. I think the results at 20 nodes are very useful to understand the performance:

I will shortly release another blog post focusing on ICE performance at 20 nodes, but before that, I want to give a quick preview, then explain exactly how Shard-Query works.

Yes, Shard-Query scales very well at 20 nodes

## Distributed set processing (theory)

### What is SQL?

As you probably know, SQL stands for “structured query language”. It isn’t so much the language that is structured, but actually the data. Every SQL statements breaks down into a relational algebra equation. In Algebra you learn that some operations are “distributable”, that is, you can split them up into smaller units, and when you put those units back together the result is the same as if you didn’t take it apart. Projection, GROUP BY, SUM, COUNT, MIN*, and MAX* are distributable. With a little elbow grease, all non-distributable aggregate functions (AVG,STDDEV,VARIANCE,etc) can be decomposed into distributable functions using simple substitution rules.

So, to recap, every SQL query is really a cleverly disguised relational algebra mathematical expression. With relational algebraic substitution, every aggregate expression, even non-distributable ones, can be broken into distributable sub-expressions.

### What is a result set?

This isn’t really a trick question. The “result set” is a SET created by the output of a relational algebra expresion. Relational algebra expressions always produce sets as output. Just to drive it home, SQL is relational algebra, and this algebra only operates on sets. The next important thing to understand about SQL is that it is declarative. That is, you tell the database what you want but not how to get it. Most distributed engines work with rows. They break the queries up into the lowest level sets possible (rows) which doesn’t makes much sense to me, since SQL is set oriented. I just said repeatedly that SQL doesn’t work on rows! Why would you break this paradigm by passing around rows instead of sets? From a distributed processing standpoint, rows are the worst case for performance. Optimal mathematical performance requires operations on reduced sets. Keep in mind, that rows based systems work well but still, these systems are much farther from optimal than working directly with relational algebra.

### Materialized views techniques applied to distributed computation

I maintain another open source tool called Flexviews which supports incrementally maintaining materialized views. While writing Flexviews, I learned how to distribute the computation of aggregation queries over time. I realised that I could apply these same mathematical concepts to distributed queries as well, but with some minor differences.

Having written Flexviews, I understood that there are special materialized view optimizations that can be applied to INSERT-only workloads, and their are other optimizations that can be applied to views that are based on only a single base table. Knowing this, the query result set is treated as a materialized view over a union of all the already joined and aggregated data from all the nodes. A single temporary table is used to store the results from all the nodes. Since we are projecting results, this is naturally an INSERT-only workload. The insertions into the base table from each node correspond logically to a records in a Flexviews materialized view delta table and thus the logic for applying changes via ON DUPLICATE KEY UPDATE is the same. All of the fastest incremental materialized view optimizations can be applied.

### Shard-Query works only on sets

Shard-Query takes relational algebra to its logical maximum conclusion, splitting a problem up into many small problems and then putting the results back together again. This set logic allows to multiple levels of reduction of the set, all in parallel. All joins, aggregation and filtering is done at the storage node level. This means that Shard-Query does not have to have any idea of the structure of the data on which your queries operate. It can, however, use a mapper for partition elimination. The mapper is pluggable.

On each storage node (or a mapped subset of nodes) the result set is aggregated using distributable aggregate functions. The results from this aggregation get a second distributed reduce using UPSERTs into the coordination node when maintaining the “materialized view” of the results. Finally, a single threaded final group-by reduction is run over the coordination node, and this projects the final result set.

### Intern-node communication

One or more gearman queues are used to compute work. Computation is massively parallel. Shard-Query simply waits for all the queries to finish, then projects the synchronously maintained incrementally refreshable materialized view that represents the output. There is no external locking or synchronization required. If a query fails on a storage node, then it can be retried on the same node, or in the future, a redundant node.

## Work on problems of any size.

### Set processing is massively parallel

.
In fact, it is embarassingly parallel. Because Shard-Query works on sets, and features pluggable partition mapping, it allows partitioning resources to any depth and distributing a query over that set of resources. Lets say you have a database system that is at capacity in data center A, and there is not enough power to add new nodes there. You can add new nodes in data center B and distribute the queries over both data centers. The response time should only be increased by the average latency, since work is done in parallel and there is very little data shipping because of the distributed reduction discussed above. If you have any limitation in resources in a cluster (cpu, memory, disk, power,etc) then split the problem up into more chunks.

Each Shard-Query instance is a essentially a proxy for the distributed set based SQL operations executed on the nodes under the proxy. The databases do 99.9% of the work, due to the multiple levels of result set reduction. Finally, Shard-Query can automatically partition sets into subsets using basic relational algebra substitution rules that are documented in this blog post. This allows BETWEEN, IN, subqueries in the FROM clause, and UNION operations to operate fully in parallel.

### Distributed set processing is database agnostic.

Keep in mind, at each level of partitioning only a list of servers and some credentials are required to make this work. It can even be set up without the aid of your database administrator.

As long as all of the nodes share a common schema model and share a common SQL dialect, then they can all participate in the distributed query processing. There is almost no data shipping, as only aggregated sets are sent over the network. In the future you will be able to distribute work over any type of compute resource which speaks SQL, but right now only MySQL storage nodes are supported. Amdahl’s law applies to the distributed processing. The results from the slowest node will place a lower bound on the minimum performance level.

Soon you will be able to set up computation over ICE, InnoDB, Vectorwise and other databases, all at the same time, and transparently to the source query. Adding a new storage node SQL dialect is almost trivial. I’ll document that process shortly, I think I need to make some minor abstraction modifications in the data access layer.

### Shard-Query can provide query execution plans based on the relational algebra rewrites

So, here is an example of such a plan for a simple query with aggregation, a JOIN, and a WHERE clause that uses an IN clause:

Edit

*MIN/MAX are only distributable in INSERT-only workloads

See this information about maintaining materialized views:

Inline Feedbacks

Good job, good start! Taking your work and idea (with ICE) to another level with the distributed GlusterFS (HA and rebalancing) and MPI on Infiniband. It will get embarassingly (parallel) fast

Justin, of course, there is a large class of OLAP applications which can be reimplemented using Shard-Query framework. This is a potentially great approach to building low-cost clustered analytical databases. There are some limitations (like no support for constellation type of schema) but many companies can live with these limitations. Just one comment: Shard-Query is quite difficult to integrate into regular enterprise application where PHP is a third class citizen in a best case (frankly speaking, I have never heard of PHP being used in an enterprise applications).

Sharding with value ranges is essentially a distributed version of B+Trees and has the same moderators on concurrency. Massively parallel databases that support both hash partitioning and range partitioning always warn that the latter will limit scale-out. It works okay at modest levels of distribution; if the database is read-only then this obviously does not matter much outside of load time.

As you point out, you can materialize joins by sharding the hierarchy at each level so that locality is preserved. That is roughly what a document database does and some commercial databases (e.g. Oracle) have supported this optimization since the 1990s. The problem is that this structure makes queries that do not strictly adhere to that hierarchical traversal very inefficient. Obviously this works for some cases (see: MongoDB).

The subtle point on joins is that join algorithms are embarrassingly parallel on some network topologies, just not on a switch fabric topology. Even if the operations shard, the network won’t. It is the reason that graph databases, which are built on selection and equi-join operators, are not usefully shardable even though they seem like they should be.

What you are doing will work up to a point, your technique has a lot of history. It is how parallel databases like Teradata work under the hood. I’m not trying to give you a hard time but the devil is in the details when it comes to MPP database engines. The first lesson everyone (myself included) learns the hard way because it is not intuitive is that some embarrassingly parallel algorithms are not parallel on a switch fabric. This turns out to be a major limitation (not surprisingly, very familiar to the HPC software community).

Your technique can efficiently scale anything that will efficiently scale with MapReduce so there is clear value. I prefer your way to be honest.

Justin,

As far as I understood, all dimension tables in your test are replicated among all nodes. Otherwise, joining fact table (partitioned) with another partitioned table will require totally different technique which is hardly implemented in Shard-Query. Am I right?

“With relational algebraic substitution, every aggregate expression, even non-distributable ones, can be broken into distributable sub-expressions”

This is true only for analytical aggregates and does not hold on holistic ones (i.e. median, distinct, percentiles).

You are correct, strictly speaking. Join operators are set oriented but are derived more from category theory (of sets) than set theory per se. They define sets of relationships rather than sets of entities.

To be clear, join algorithms can be sharded and in one uselessly narrow case they will even scale linearly. However, most sharded joins built from set algebras and logic have a strong sublinear scaling character even though they shard easily and, as you note above, self-joins are so sublinear that there is no value in sharding the algorithms at all. I was speaking more to practical limitations on sharding joins.

(There are unpublished algorithms that will shard a self-join with linear scalability — I’ve seen two — but those relational algebras are built from topology theory rather than the more obvious set theory.)

While I agree that database operations can (and should) be decomposed into an elegantly parallelizable relational algebra, the major limitation is that relational join operations are not efficiently distributable which limits generality of sharding approaches to databases.

Join algorithms are canonical examples of the difference between data parallel algorithms (i.e. algorithms that only scale on a network where every node is directly connected to every other node) and disjoint-access parallel algorithms (i.e. algorithms that scale on a network where every node is connected to a non-blocking switch fabric). Standard join algorithms only distribute efficiently in the directly connected topology model but all of our clusters tend to be switch fabrics, hence why joins are a perennial scaling problem for parallel databases.