Since Fractal Tree indexes turn random writes into sequential writes, it’s easy to see why they offer a big advantage for maintaining indexes on rotating disks. It turns out that that Fractal Tree indexing also offers signficant advantages on SSD. Here are three ways that Fractal Trees improve your life if you use SSDs.
The results below show the insertion of 1 billion rows into a table while maintaining three multicolumn secondary indexes. At the end of the test, TokuDB’s insertion rate remained at 14,532 inserts/second whereas InnoDB had dropped to 1,607 inserts/second. That’s a difference of over 9x.
|Platform: Centos 5.6; 2x Xeon L5520; 72GB RAM; LSI MegaRaid 9285; 2x 256GB Samsung 830 in RAID0.|
Even on flash, I/O performance costs something. Since TokuDB employs Fractal Tree write-optimized indexes, it performs an order-of-magnitude less I/O for index maintenance. I plan to dive into how Fractal Tree indexes work here sometime soon. I’ll be explaining them again at both MySQL Connect and Percona Live New York, and slides from previous talks are available (for example my talk on “how fractal trees work”, or the tutorial that Michael Bender and I gave at XLDB two weeks ago).
Compression is an always-on feature of TokuDB. Here is a graph showing the effective compression for some web application performance data (log-style data with stored procedure names, database instance names, begin and ending execution timestamps, duration row counts, and parameter values). We tested InnoDB compression with two values of key_block_size (4k and 8k) and with compression disabled. TokuDB achieved up to 26x compression, far more than InnoDB. In addition, in the second chart, you can see that even with compression, TokuDB performance remains superior, as indicated by the load speeds for each run. Note that web application performance data is very compressible, so this is showing both InnoDB and TokuDB at their best. Our customers typically see an average compression factors of around 5 to 10 on their data, and InnoDB users probably see 2-fold compression or don’t use compression at all.
The difference between InnoDB compression and TokuDB compression can be explained by the fact that TokuDB employs a write-optimized data structure. TokuDB writes large blocks infrequently. TokuDB’s writes are large: roughly speaking, between 2MB and 4MB before compression, whereas InnoDB is writing 16KB, 32KB, or 64KB before compression (depending on which compression mode is used). Most compression algorithms, such as bzip and gzip2, don’t get much compression on small blocks — just as the compression learns enough about the block to start compressing it, the block ends. So TokuDB’s large blocks get better compression. That doesn’t matter as much for this data as for data with lower compression ratios, because InnoDB is achieving its maximum compression of 4X on this data set in 4:1 compression mode (after compressing the 64KB block down to something that might be as small as 6KB, InnoDB then pads the data and stores it in a 16KB block.)
For the aggressive compression mode, we use LZMA compression. LZMA has the property that it’s about as slow as bzip for compression (which is pretty slow), and about as fast as gzip for decompression (which is pretty fast), and achieves compression ratios close to bzip. (If you want to estimate how well your data will compress, try taking a comma-separated data file containing your data and run it through gzip or bzip. The gzip number will estimate TokuDB standard compression, and the bzip number will estimate TokuDB aggressive compression. If want a more accurate estimate, get and run lzma on your comma-separated data.) A compressor that is expensive for compression and cheap for decompression is a great match for TokuDB, since TokuDB compresses infrequently (since writes are infrequent) and decompresses relatively frequently (whenever data is read from disk).
Compressing the data also provides performance advantages, especially on read operations, since less data needs to be fetched from disk.
B-trees write small blocks, resulting in more writes and increased wear. Fractal Tree indexes write fewer and larger blocks, reducing wear. In this example, from iiBench, TokuDB (avg block size 215k) is performing 1/25th the amount of writes per second relative to InnoDB (avg block size 23k). But the reduced writes are even better than that looks, since TokuDB finishes 6 times faster. InnoDB performs 59 million writes, whereas TokuDB performs only 368,000 writes (a factor of 161). Geometrically, if you look at the area under the under the curve, that will give you the total number of writes performed by each storage engine. Since TokuDB’s writes are larger, it turns out that InnoDB writes a total of about 17 times as many bytes as does TokuDB.
That factor of 17 reduction in the number of bytes written translates to a factor of 17 reduction in wear on the SSD. Here I’ll try to estimate how long the drive would last under this workload.
According to wikipedia, modern MLC SSDs can be written about 1000 times before wearing out. I have no data how long the Samsung drive would actually last, however. It might be made of older MLC flash chips that last more than 1000 overwrites. On the other hand, one SSD insider told me that many modern SSDs are shipping even newer MLC that would be lucky to achieve 1000 overwrites. If someone knows something more concrete, please let me know. I’ll assume that you get 1000 overwrites.
How long does it take to overwrite the entire SSD? I measured a sustained throughput of around 10MB/s on random writes on a single Samsung drive with 16KB blocks using sysbench. At that rate, I should be able to overwrite the 256GB drive in about 7 hours, and I might be able to wear it out in about 300 days.
But SSD wears out even faster than that due to garbage collection in the flash translation layer (FTL). If I fill up my 256GB drive with close to 256GB of data, and if the manufacturer actually overprovisioned by something like 30 percent, then random writes induce a write amplification factor of around 4. So the drive might really last only 75 days.
I cannot actually write that much data that fast on InnoDB, which achieved a throughput of 1.3 terabytes in 120 hours (about 3MB/s). So InnoDB running iiBench might be able to wear out my drive in about 1000 days (3 years), or about 246 days if you account for garbage-collection-induced write amplification.
With the same workload under TokuDB, the drive would last 17 times long (11 years). Depending on the size of the erase block, TokuDB might do even better. For example, if the effective erase block of the drive is smaller than the size of the written block, then the FTL can achieve lower write amplification. If anyone knows what the effective erase block size is for various SSDs, I’d like to hear about it.
To learn more about TokuDB: