Infobright and InnoDB AMI images are now available
There are now demonstration AMI images for Shard-Query. Each image comes pre-loaded with the data used in the previous Shard-Query blog post. The data in the each image is split into 20 “shards”. This blog post will refer to an EC2 instances as a node from here on out. Shard-Query is very flexible in it’s configuration, so you can use this sample database to spread processing over up to 20 nodes.
The Infobright Community Edition (ICE) images are available in 32 and 64 bit varieties. Due to memory requirements, the InnoDB versions are only available on 64 bit instances. MySQL will fail to start on a micro instance, simply decrease the values in the /etc/my.cnf file if you really want to try micro instances.
*EDIT*
The storage worker currently logs too much information. This can cause the disk to fill up with logs. You can fix this by modifying shard-query/run_worker to contain the following:
1 2 3 4 5 |
#!/bin/bash while [ 1 ] do ./worker >> /dev/null 2>&1 < /dev/null done; |
Where to find the images
Amazon ID | Name | Arch | Notes |
---|---|---|---|
ami-20b74949 | shard-query-infobright-demo-64bit | x86_64 | ICE 3.5.2pl1. Requires m1.large or larger |
ami-8eb648e7 | shard-query-innodb-demo-64bit | x86_64 | Percona Server 5.5.11 with XtraDB. Requires m1.large or larger. |
ami-f65ea19f | shard-query-infobright-demo | i686 | ICE 3.5.2pl1 32bit. Requires m1.small or greater. |
snap-073b6e68 | shard-query-demo-data-flatfiles | 30GB ext3 EBS | This is an ext3 volume which contains the flat files for the demos, if you want to reload on your favorite storage engine or database |
About the cluster
For best performance, there should be an even data distribution in the system. To get an even distribution, the test data was hashed over the values in the date_id column. There will be another blog post about the usage and performance of the splitter. It is multi-threaded(actually multi-process) and is able to hash split up to 50GB/hour of input data on my i970 test machine. It is possible to distribute splitting and/or loading among multiple nodes as well. Note that in the demonstration each node will contain redundant, but non-accessed data for all configurations of more than one node. This would not be the case in normal circumstances. The extra data will not impact performance because it will never be accessed.
Since both InnoDB and ICE versions of the data are available it is important to examine the differences in size. This will give us some interesting information about how Shard-Query will perform on each database. To do the size comparison, I used the du utility:
InnoDB file size on disk: 42GB (with indexes)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
# du -sh * 203M ibdata1 128M ib_logfile0 128M ib_logfile1 988K mysql 2.1G ontime1 2.1G ontime10 2.1G ontime11 2.1G ontime12 2.1G ontime13 2.1G ontime14 2.1G ontime15 2.1G ontime16 2.1G ontime17 2.1G ontime18 2.1G ontime19 2.1G ontime2 2.1G ontime20 2.1G ontime3 2.1G ontime4 2.1G ontime5 2.1G ontime6 2.1G ontime7 2.1G ontime8 2.1G ontime9 212K performance_schema 0 test |
ICE size on disk: 2.5GB
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
# du -sh * 8.0K bh.err 11M BH_RSI_Repository 4.0K brighthouse.ini 4.0K brighthouse.log 4.0K brighthouse.seq 964K mysql 123M ontime1 124M ontime10 123M ontime11 123M ontime12 123M ontime13 123M ontime14 123M ontime15 123M ontime16 123M ontime17 123M ontime18 124M ontime19 124M ontime2 124M ontime20 124M ontime3 123M ontime4 122M ontime5 122M ontime6 122M ontime7 123M ontime8 125M ontime9 |
The InnoDB data directory size is 42GB, which is twice the original size of the input data. The ICE schema was discussed in the comments of the last post. ICE does not have any indexes (not even primary keys).
Here is the complete InnoDB schema from one shard. The schema is duplicated 20 times (but not the ontime_fact data):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 |
DROP TABLE IF EXISTS `dim_airport`; CREATE TABLE `dim_airport` ( `airport_id` int(11) NOT NULL DEFAULT '0', `airport_code` char(3) DEFAULT NULL, `CityName` varchar(100) DEFAULT NULL, `State` char(2) DEFAULT NULL, `StateFips` varchar(10) DEFAULT NULL, `StateName` varchar(50) NOT NULL, `Wac` int(11) DEFAULT NULL, PRIMARY KEY (`airport_id`), KEY `CityName` (`CityName`), KEY `State` (`State`) ) ENGINE=InnoDB DEFAULT CHARSET=latin1 COMMENT='Data from BTS ontime flight data. Data for Origin and Destination airport data.'; CREATE TABLE `dim_date` ( `Year` year(4) DEFAULT NULL, `Quarter` tinyint(4) DEFAULT NULL, `Month` tinyint(4) DEFAULT NULL, `DayofMonth` tinyint(4) DEFAULT NULL, `DayOfWeek` tinyint(4) DEFAULT NULL, `FlightDate` date NOT NULL, `date_id` smallint(6) NOT NULL, PRIMARY KEY (`date_id`), KEY `FlightDate` (`FlightDate`), KEY `Year` (`Year`,`Quarter`,`Month`,`DayOfWeek`), KEY `Quarter` (`Quarter`,`Month`,`DayOfWeek`), KEY `Month` (`Month`,`DayOfWeek`), KEY `DayOfWeek` (`DayOfWeek`) ) ENGINE=InnoDB DEFAULT CHARSET=latin1 COMMENT='Contains the date information from the BTS ontime flight data. Note dates may not be in date_id order'; /*!40101 SET character_set_client = @saved_cs_client */; CREATE TABLE `dim_flight` ( `UniqueCarrier` char(7) DEFAULT NULL, `AirlineID` int(11) DEFAULT NULL, `Carrier` char(2) DEFAULT NULL, `FlightNum` varchar(10) DEFAULT NULL, `flight_id` int(11) NOT NULL DEFAULT '0', `AirlineName` varchar(100) DEFAULT NULL, PRIMARY KEY (`flight_id`), KEY `UniqueCarrier` (`UniqueCarrier`,`AirlineID`,`Carrier`), KEY `AirlineID` (`AirlineID`,`Carrier`), KEY `Carrier` (`Carrier`) ) ENGINE=InnoDB DEFAULT CHARSET=latin1 COMMENT='Contains information on flights, and what airline offered those flights and the flight number of the flight. Some data hand updated.'; -- -- Table structure for table `ontime_fact` -- CREATE TABLE `ontime_fact` ( `date_id` int(11) NOT NULL DEFAULT '0', `origin_airport_id` int(11) NOT NULL DEFAULT '0', `dest_airport_id` int(11) NOT NULL DEFAULT '0', `flight_id` int(11) NOT NULL DEFAULT '0', `TailNum` varchar(50) DEFAULT NULL, `CRSDepTime` int(11) DEFAULT NULL, `DepTime` int(11) DEFAULT NULL, `DepDelay` int(11) DEFAULT NULL, `DepDelayMinutes` int(11) DEFAULT NULL, `DepDel15` int(11) DEFAULT NULL, `DepartureDelayGroups` int(11) DEFAULT NULL, `DepTimeBlk` varchar(20) DEFAULT NULL, `TaxiOut` int(11) DEFAULT NULL, `WheelsOff` int(11) DEFAULT NULL, `WheelsOn` int(11) DEFAULT NULL, `TaxiIn` int(11) DEFAULT NULL, `CRSArrTime` int(11) DEFAULT NULL, `ArrTime` int(11) DEFAULT NULL, `ArrDelay` int(11) DEFAULT NULL, `ArrDelayMinutes` int(11) DEFAULT NULL, `ArrDel15` int(11) DEFAULT NULL, `ArrivalDelayGroups` int(11) DEFAULT NULL, `ArrTimeBlk` varchar(20) DEFAULT NULL, `Cancelled` tinyint(4) DEFAULT NULL, `CancellationCode` char(1) DEFAULT NULL, `Diverted` tinyint(4) DEFAULT NULL, `CRSElapsedTime` int(11) DEFAULT NULL, `ActualElapsedTime` int(11) DEFAULT NULL, `AirTime` int(11) DEFAULT NULL, `Flights` int(11) DEFAULT NULL, `Distance` int(11) DEFAULT NULL, `DistanceGroup` tinyint(4) DEFAULT NULL, `CarrierDelay` int(11) DEFAULT NULL, `WeatherDelay` int(11) DEFAULT NULL, `NASDelay` int(11) DEFAULT NULL, `SecurityDelay` int(11) DEFAULT NULL, `LateAircraftDelay` int(11) DEFAULT NULL, `FirstDepTime` varchar(10) DEFAULT NULL, `TotalAddGTime` varchar(10) DEFAULT NULL, `LongestAddGTime` varchar(10) DEFAULT NULL, `DivAirportLandings` varchar(10) DEFAULT NULL, `DivReachedDest` varchar(10) DEFAULT NULL, `DivActualElapsedTime` varchar(10) DEFAULT NULL, `DivArrDelay` varchar(10) DEFAULT NULL, `DivDistance` varchar(10) DEFAULT NULL, `Div1Airport` varchar(10) DEFAULT NULL, `Div1WheelsOn` varchar(10) DEFAULT NULL, `Div1TotalGTime` varchar(10) DEFAULT NULL, `Div1LongestGTime` varchar(10) DEFAULT NULL, `Div1WheelsOff` varchar(10) DEFAULT NULL, `Div1TailNum` varchar(10) DEFAULT NULL, `Div2Airport` varchar(10) DEFAULT NULL, `Div2WheelsOn` varchar(10) DEFAULT NULL, `Div2TotalGTime` varchar(10) DEFAULT NULL, `Div2LongestGTime` varchar(10) DEFAULT NULL, `Div2WheelsOff` varchar(10) DEFAULT NULL, `Div2TailNum` varchar(10) DEFAULT NULL, `Div3Airport` varchar(10) DEFAULT NULL, `Div3WheelsOn` varchar(10) DEFAULT NULL, `Div3TotalGTime` varchar(10) DEFAULT NULL, `Div3LongestGTime` varchar(10) DEFAULT NULL, `Div3WheelsOff` varchar(10) DEFAULT NULL, `Div3TailNum` varchar(10) DEFAULT NULL, `Div4Airport` varchar(10) DEFAULT NULL, `Div4WheelsOn` varchar(10) DEFAULT NULL, `Div4TotalGTime` varchar(10) DEFAULT NULL, `Div4LongestGTime` varchar(10) DEFAULT NULL, `Div4WheelsOff` varchar(10) DEFAULT NULL, `Div4TailNum` varchar(10) DEFAULT NULL, `Div5Airport` varchar(10) DEFAULT NULL, `Div5WheelsOn` varchar(10) DEFAULT NULL, `Div5TotalGTime` varchar(10) DEFAULT NULL, `Div5LongestGTime` varchar(10) DEFAULT NULL, `Div5WheelsOff` varchar(10) DEFAULT NULL, `Div5TailNum` varchar(10) DEFAULT NULL, KEY `date_id` (`date_id`), KEY `flight_id` (`flight_id`), KEY `origin_airport_id` (`origin_airport_id`), KEY `dest_airport_id` (`dest_airport_id`), KEY `DepDelay` (`DepDelay`) ) ENGINE=InnoDB DEFAULT CHARSET=latin1 COMMENT='Contains all avaialble data from 1988 to 2010'; mysql> use ontime1; Database changed mysql> show table status like 'ontime_fact'G *************************** 1. row *************************** Name: ontime_fact Engine: InnoDB Version: 10 Row_format: Compact Rows: 6697533 Avg_row_length: 241 Data_length: 1616904192 Max_data_length: 0 Index_length: 539279360 Data_free: 4194304 Auto_increment: NULL Create_time: 2011-05-10 04:26:14 Update_time: NULL Check_time: NULL Collation: latin1_swedish_ci Checksum: NULL Create_options: Comment: Contains all avaialble data from 1988 to 2010 1 row in set (0.00 sec) |
With ICE, after compression there is only 2.5GB of data, so ICE gets over 16:1 compression ratio(compared to Innodb, 8:1 compared to raw input data), which is quite nice. Each shard contains only 128MB of data!
Storage engine makes a big difference
In general, a column store performs about 8x-10x better than a row store for queries which access a significant amount of data. One big reason for this is the excellent compression that RLE techniques provide.
I have not loaded InnoDB compressed tables yet but since InnoDB compression is not RLE, I doubt it will have the same impact.
For large datasets effective compression results in the need for fewer nodes in order to keep data entirely in memory. This frees disk to use on-disk temporary storage for hash joins and other background operations. This will have a direct impact in our query response times and throughput.
Setting up a cluster using the AMI images
You can easily test Shard-Query for yourself. Spin up the desired number of EC2 instances using on of the the AMI images. You should spin a number of instances that evenly divides into 20 for best results. There is a helpful utility (included in the image) to help configure the cluster and it uses a copy of this text on this page. To use it, ensure:
- That only the instances that you want to use are shown in the EC2 console.
- That the “private ip” field is selected in the list of columns to show (click show/hide to change the columns)
- That the “public dns” field is selected
SSH to the public DNS entry of the node on the list of nodes. This node will become “shard1”.
Now, in the EC2 console hit CTRL-A to select all text on the page and then CTRL-C to copy it. Paste this into a text file on shard1 called “/tmp/servers.txt” and run the following commands:
1 2 |
$ cat servers.txt | grep "10."| grep -v internal |tee hosts.internal [host list omitted] |
Now you need to set up the hosts file:
1 2 3 4 5 6 7 |
sudo su - # cat hosts.internal | ~ec2-user/tools/mkhosts >> /etc/hosts # ping shard20 PING shard20 (10.126.15.34) 56(84) bytes of data. 64 bytes from shard20 (10.126.15.34): icmp_seq=1 ttl=61 time=0.637 ms ... |
Note: There is no need to put that hosts file on your other nodes unless you want to run workers on them.
Generate a cluster configuration
There is a script provided to generate the shards.ini file for testing an cluster of 1 to 20 nodes.
1 2 3 4 |
cd shard-query #generate a config for 20 shards (adjust to your number of nodes) php genconfig 20 > shards.ini |
Running the test
For best performance, you should run the workers on one or two nodes. You should start two workers per core in the cluster.
First start gearmand:
1 |
gearmand -p 7000 -d |
Then start the workers on node 1 (assuming a 20 node cluster):
1 2 |
cd shard-query ./start_workers 80 |
I normally start (2 * TOTAL_CLUSTER_CORES) workers. That is, if you have 20 machines, each with 2 cores, run 80 workers.
Test the system. You should see the following row count (the first number is wall time, the second exec time, the third parse time).
1 2 3 4 5 6 7 |
$ echo "select count(*) from ontime_fact;" | ./run_query Array ( [count(*)] => 135125787 ) 1 rows returned (0.084244966506958s, 0.078309059143066s, 0.0059359073638916s) |
Execute the test:
As seen above, the run_query script will run one more more semicolon terminated SQL statements. The queries for the benchmark are in ~ec2-user/shard-query/queries.sql.
I have also provided a convenient script which will summarize the output from the ./run_query command, called pivot_results
1 2 3 4 5 6 |
cd shard-query/ $ ./run_query < queries.sql | tee raw |./pivot_results & [1] 12359 $ tail -f ./raw -- Q1 ... |
At the end, you will get a result output that is easy to graph in a spreadsheet:
1 2 3 |
$ cat raw | ./pivot_results Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8.0,Q8.1,Q8.2,Q8.3,Q8.4,Q9,Q10,Q11 34.354,60.978,114.175,27.138,45.751,14.905,14.732,34.946,126.599,250.222,529.287,581.295,11.042,63.366,14.573 |
InnoDB my.cnf
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 |
[client] port=3306 socket=/tmp/mysql-inno.sock [mysqld] socket=/tmp/mysql-inno.sock default-storage-engine=INNODB innodb-buffer-pool-instances=2 innodb-buffer-pool-size=5600M innodb-file-format=barracuda innodb-file-per-table innodb-flush-log-at-trx-commit=1 innodb-flush-method=O_DIRECT innodb-ibuf-active-contract=1 innodb-import-table-from-xtrabackup=1 innodb-io-capacity=1000 innodb-log-buffer-size=32M innodb-log-file-size=128M innodb-open-files=1000 innodb_fast_checksum innodb-purge-threads=1 innodb-read-ahead=linear innodb-read-ahead-threshold=8 innodb-read-io-threads=16 innodb-recovery-stats innodb-recovery-update-relay-log innodb-replication-delay=# innodb-rollback-on-timeout innodb-rollback-segments=16 innodb-stats-auto-update=0 innodb-stats-on-metadata=0 innodb-stats-sample-pages=256 innodb-stats-update-need-lock=0 innodb-status-file innodb-strict-mode innodb-thread-concurrency=0 innodb-thread-concurrency-timer-based innodb-thread-sleep-delay=0 innodb-use-sys-stats-table innodb-write-io-threads=4 join-buffer-size=16M key-buffer-size=64M local-infile=on lock-wait-timeout=300 log-error=/var/log/mysqld-innodb.log max-allowed-packet=1M net-buffer-length=16K #we value throughput over response time, get a good plan optimizer-prune-level=0 partition=ON port=3306 read-buffer-size=512K read-rnd-buffer-size=1M skip-host-cache skip-name-resolve sort-buffer-size=512K sql-mode=STRICT_TRANS_TABLES symbolic-links table-definition-cache=16384 table-open-cache=128 thread-cache-size=32 thread-stack=256K tmp-table-size=64M transaction-isolation=READ-COMMITTED user=mysql wait-timeout=86400 |
To be continued
You can now set up a cluster from 1 to 20 nodes for testing. This way you can verify the numbers in my next blog post. I will compare performance of various cluster sizes on both storage engines.
Thank you for taking the time to put together such an extensive writeup on this. I hope to put this to practice some day.
I don’t understand why hashing against the date-id is a good idea? I’m assuming date-id was chosen because this is only meant as a demo.
In production, wouldn’t this put the majority of the load on one server as new inserts would have similar date-ids?