Distributed Set Processing with Shard-Query

Can Shard-Query scale to 20 nodes?

Peter asked this question in comments to to my previous Shard-Query benchmark. Actually he asked if it could scale to 50, but testing 20 was all I could due to to EC2 and time limits. I think the results at 20 nodes are very useful to understand the performance:

I will shortly release another blog post focusing on ICE performance at 20 nodes, but before that, I want to give a quick preview, then explain exactly how Shard-Query works.
Query response time from 1 to 20 nodes

Yes, Shard-Query scales very well at 20 nodes

Distributed set processing (theory)

What is SQL?

As you probably know, SQL stands for “structured query language”. It isn’t so much the language that is structured, but actually the data. Every SQL statements breaks down into a relational algebra equation. In Algebra you learn that some operations are “distributable”, that is, you can split them up into smaller units, and when you put those units back together the result is the same as if you didn’t take it apart. Projection, GROUP BY, SUM, COUNT, MIN*, and MAX* are distributable. With a little elbow grease, all non-distributable aggregate functions (AVG,STDDEV,VARIANCE,etc) can be decomposed into distributable functions using simple substitution rules.

So, to recap, every SQL query is really a cleverly disguised relational algebra mathematical expression. With relational algebraic substitution, every aggregate expression, even non-distributable ones, can be broken into distributable sub-expressions.

What is a result set?

This isn’t really a trick question. The “result set” is a SET created by the output of a relational algebra expresion. Relational algebra expressions always produce sets as output. Just to drive it home, SQL is relational algebra, and this algebra only operates on sets. The next important thing to understand about SQL is that it is declarative. That is, you tell the database what you want but not how to get it. Most distributed engines work with rows. They break the queries up into the lowest level sets possible (rows) which doesn’t makes much sense to me, since SQL is set oriented. I just said repeatedly that SQL doesn’t work on rows! Why would you break this paradigm by passing around rows instead of sets? From a distributed processing standpoint, rows are the worst case for performance. Optimal mathematical performance requires operations on reduced sets. Keep in mind, that rows based systems work well but still, these systems are much farther from optimal than working directly with relational algebra.

Materialized views techniques applied to distributed computation

I maintain another open source tool called Flexviews which supports incrementally maintaining materialized views. While writing Flexviews, I learned how to distribute the computation of aggregation queries over time. I realised that I could apply these same mathematical concepts to distributed queries as well, but with some minor differences.

Having written Flexviews, I understood that there are special materialized view optimizations that can be applied to INSERT-only workloads, and their are other optimizations that can be applied to views that are based on only a single base table. Knowing this, the query result set is treated as a materialized view over a union of all the already joined and aggregated data from all the nodes. A single temporary table is used to store the results from all the nodes. Since we are projecting results, this is naturally an INSERT-only workload. The insertions into the base table from each node correspond logically to a records in a Flexviews materialized view delta table and thus the logic for applying changes via ON DUPLICATE KEY UPDATE is the same. All of the fastest incremental materialized view optimizations can be applied.

Shard-Query works only on sets

Shard-Query takes relational algebra to its logical maximum conclusion, splitting a problem up into many small problems and then putting the results back together again. This set logic allows to multiple levels of reduction of the set, all in parallel. All joins, aggregation and filtering is done at the storage node level. This means that Shard-Query does not have to have any idea of the structure of the data on which your queries operate. It can, however, use a mapper for partition elimination. The mapper is pluggable.

On each storage node (or a mapped subset of nodes) the result set is aggregated using distributable aggregate functions. The results from this aggregation get a second distributed reduce using UPSERTs into the coordination node when maintaining the “materialized view” of the results. Finally, a single threaded final group-by reduction is run over the coordination node, and this projects the final result set.

Intern-node communication

One or more gearman queues are used to compute work. Computation is massively parallel. Shard-Query simply waits for all the queries to finish, then projects the synchronously maintained incrementally refreshable materialized view that represents the output. There is no external locking or synchronization required. If a query fails on a storage node, then it can be retried on the same node, or in the future, a redundant node.

Work on problems of any size.

Set processing is massively parallel

.
In fact, it is embarassingly parallel. Because Shard-Query works on sets, and features pluggable partition mapping, it allows partitioning resources to any depth and distributing a query over that set of resources. Lets say you have a database system that is at capacity in data center A, and there is not enough power to add new nodes there. You can add new nodes in data center B and distribute the queries over both data centers. The response time should only be increased by the average latency, since work is done in parallel and there is very little data shipping because of the distributed reduction discussed above. If you have any limitation in resources in a cluster (cpu, memory, disk, power,etc) then split the problem up into more chunks.

Each Shard-Query instance is a essentially a proxy for the distributed set based SQL operations executed on the nodes under the proxy. The databases do 99.9% of the work, due to the multiple levels of result set reduction. Finally, Shard-Query can automatically partition sets into subsets using basic relational algebra substitution rules that are documented in this blog post. This allows BETWEEN, IN, subqueries in the FROM clause, and UNION operations to operate fully in parallel.

Distributed set processing is database agnostic.

Keep in mind, at each level of partitioning only a list of servers and some credentials are required to make this work. It can even be set up without the aid of your database administrator.

As long as all of the nodes share a common schema model and share a common SQL dialect, then they can all participate in the distributed query processing. There is almost no data shipping, as only aggregated sets are sent over the network. In the future you will be able to distribute work over any type of compute resource which speaks SQL, but right now only MySQL storage nodes are supported. Amdahl’s law applies to the distributed processing. The results from the slowest node will place a lower bound on the minimum performance level.

Soon you will be able to set up computation over ICE, InnoDB, Vectorwise and other databases, all at the same time, and transparently to the source query. Adding a new storage node SQL dialect is almost trivial. I’ll document that process shortly, I think I need to make some minor abstraction modifications in the data access layer.

Shard-Query can provide query execution plans based on the relational algebra rewrites

So, here is an example of such a plan for a simple query with aggregation, a JOIN, and a WHERE clause that uses an IN clause:

Edit

*MIN/MAX are only distributable in INSERT-only workloads

See this information about maintaining materialized views:

http://scholar.google.com/scholar?hl=en&lr=&q=related:Y6OKUW79CGQJ:scholar.google.com

Share this post

Comments (27)

  • Justin Swanhart

    Vlad,

    Eventually a proxy that fronts the framework would allow enterprise apps to interact with it as if it were a regular MySQL database server. This will allow Mondrian, etc, to work with it transparently as well.

    May 14, 2011 at 12:00 am
  • Justin Swanhart

    Thanks Hung, let us know about your results!

    May 14, 2011 at 12:00 am
  • Hung Phan

    Good job, good start! Taking your work and idea (with ICE) to another level with the distributed GlusterFS (HA and rebalancing) and MPI on Infiniband. It will get embarassingly (parallel) fast :-)

    May 14, 2011 at 12:00 am
  • Vlad

    Justin, of course, there is a large class of OLAP applications which can be reimplemented using Shard-Query framework. This is a potentially great approach to building low-cost clustered analytical databases. There are some limitations (like no support for constellation type of schema) but many companies can live with these limitations. Just one comment: Shard-Query is quite difficult to integrate into regular enterprise application where PHP is a third class citizen in a best case (frankly speaking, I have never heard of PHP being used in an enterprise applications).

    May 14, 2011 at 12:00 am
  • Justin Swanhart

    Regardless, Shard-Query suits the class of problems that I want to target. I want to be able to provide fast SQL based analytics on a star schema for ROLAP based processing while spreading the work over whatever database technology is available. In the case where a dimension would be too long, simply make it a degenerate dimension, or fully denormalize for ultimate performance.

    May 14, 2011 at 12:00 am
  • Justin Swanhart

    There are more efficient optimizations for the non-distributable aggregates, but one can only work on so many things before real-world testing.

    May 14, 2011 at 12:00 am
  • Justin Swanhart

    Pushing into the group by would normally be a problem, but the incremental maintenance of the #tmp table addresses that issue (the UPSERT).

    May 14, 2011 at 12:00 am
  • Justin Swanhart

    percentile can use GROUP_CONCAT on mysql. Flexviews has native support for asynchronous maintenance of materialized views with all aggregate fuctions + count(distinct) and Shard-Query can distribute count(distinct) over a cluster. My benchmark tests this extensively if you take a look at the queries.

    May 14, 2011 at 12:00 am
  • Justin Swanhart

    COUNT(distinct), percentile and median can be decomposed into distributable forms.

    MEAN can be turned into SUM/COUNT
    STDDEV, etc can be pushed into the GROUP BY at the shard level, and then the collected set from all the shards has a valid list of distinct values to evaluate.

    May 14, 2011 at 12:00 am
  • J. Andrew Rogers

    Sharding with value ranges is essentially a distributed version of B+Trees and has the same moderators on concurrency. Massively parallel databases that support both hash partitioning and range partitioning always warn that the latter will limit scale-out. It works okay at modest levels of distribution; if the database is read-only then this obviously does not matter much outside of load time.

    As you point out, you can materialize joins by sharding the hierarchy at each level so that locality is preserved. That is roughly what a document database does and some commercial databases (e.g. Oracle) have supported this optimization since the 1990s. The problem is that this structure makes queries that do not strictly adhere to that hierarchical traversal very inefficient. Obviously this works for some cases (see: MongoDB).

    The subtle point on joins is that join algorithms are embarrassingly parallel on some network topologies, just not on a switch fabric topology. Even if the operations shard, the network won’t. It is the reason that graph databases, which are built on selection and equi-join operators, are not usefully shardable even though they seem like they should be.

    What you are doing will work up to a point, your technique has a lot of history. It is how parallel databases like Teradata work under the hood. I’m not trying to give you a hard time but the devil is in the details when it comes to MPP database engines. The first lesson everyone (myself included) learns the hard way because it is not intuitive is that some embarrassingly parallel algorithms are not parallel on a switch fabric. This turns out to be a major limitation (not surprisingly, very familiar to the HPC software community).

    Your technique can efficiently scale anything that will efficiently scale with MapReduce so there is clear value. I prefer your way to be honest.

    May 14, 2011 at 12:00 am
  • Justin Swanhart

    Flexviews allows you to efficiently maintain the aggregate tables over time based only what changed in the database. A sufficiently sharded database will have a small amount of data in each node, and you can make sure this is so by using multiple tiers.

    If you want to replicate data between nodes, Tungsten offers external replication features for most databases.

    May 14, 2011 at 12:00 am
  • Justin Swanhart

    Vlad yes. You can use map-reduce on top this this if you want to, or you can just fully denormalize. All the “dimensions” will just end up being built as aggregate tables and mondrian can use those transparently anyway.

    May 14, 2011 at 12:00 am
  • Vlad

    Justin,

    As far as I understood, all dimension tables in your test are replicated among all nodes. Otherwise, joining fact table (partitioned) with another partitioned table will require totally different technique which is hardly implemented in Shard-Query. Am I right?

    May 14, 2011 at 12:00 am
  • Vlad

    “With relational algebraic substitution, every aggregate expression, even non-distributable ones, can be broken into distributable sub-expressions”

    This is true only for analytical aggregates and does not hold on holistic ones (i.e. median, distinct, percentiles).

    May 14, 2011 at 12:00 am</