Demonstrating distributed set processing performance
Shard-Query + ICE scales very well up to at least 20 nodes
This post is a detailed performance analysis of what I’ve coined “distributed set processing”.
Please also read this post’s “sister post” which describes the distributed set processing technique.
Also, remember that Percona can help you get up and running using these tools in no time flat. We are the only ones with experience with them.
20 is the maximum number of nodes that I am allowed to create in EC2. I can test further on our 32 core system, but I wanted to do a real world “cloud” test to show that this works over the network, in a real world environment. There is a slight performance oddity at 16 nodes. I suspect the EC2 environment is the reason for this, but it requires further investigation.
Next I compared the performance of 20 m1.large machines cold, versus hot. I did not record the cold results on the c1.medium machines, so only the warm results are provided for reference. Remember that the raw input data set was 55GB before converting to a star schema (21GB) and being compressed by ICE to 2.5GB. Many of these queries examine the entire data set doing origin/count(distinct destination) combinations across two dimension (origin/dest), each with 400 unique items.
In the following chart you will see performance at a single node as the tall blue line, and the short cyan line is 20 nodes. In order to avoid too many bars on the chart, response times between 2 and 16 nodes (inclusive) are shown as lines.
Concurrency testing is important too. I tested a 20 node m1.large system at 1,2,4,8,16 and 32 threads of concurrency.
The following simple bash scripts were used for the concurrency test:
$ cat start_bench
if [ "$1" = "" ]
if [ "$2" = "" ]
for i in `seq 1 $workers`
echo "Launching benchmark worker #$i with $iterations iterations"
./bench_worker $iterations $workers &
$ cat bench_worker
for i in `seq 1 $1`
mkdir -p "results/$2/"
./run_query < queries.sql |egrep "rows|^-" > results/$2/raw.$$.$i.txt
Query processing is handled by a Gearman queue which limits the maximum number of concurrent storage node queries. This prevents the system from being overloading and provides a scalable average response time under increased concurrency. Another queue in front of the storage nodes is probably advisable.