During last April’s Percona Live MySQL Conference and Expo, TokuDB celebrated it’s first full-year as an open source storage engine. I still remember reading the official announcement and the expectations it created one year ago. The premises were very interesting as it had the potential of helping MySQL manage “big data” in a way InnoDB just couldn’t. It also provided additional interesting features like “hot schema changes,” all the while making our dear flash storages last longer.
While I’ve kept an eye on the evolution of TokuDB this past year, I reckon I haven’t given it a try…. until recently, when a beta version of Percona Server was released with support for TokuDB as a plugin.
If you haven’t tried TokuDB yet here’s an opportunity to do so. This first post looks into giving a bit of context about how TokuDB fits into MySQL and the next one, to be published in the following days, will document my experience in installing it with Percona Server. I hope you enjoy both and I would appreciate it if you’d take the time to add your comments and questions at the end of the post, so we can learn from each other.
As most of you know well, the heart of MySQL lies in the storage engine. InnoDB has completely changed MySQL, bringing not only transacional capabilities but also stability and new levels of maturity to the whole system. Even those that don’t really need transactions rejoice in the crash resistance strength of InnoDB. But do you remember that not so long ago InnoDB was a third-party proprietary plugin ? At first you would need to compile MySQL with it. Later they made it all easier by allowing the installation and load of the plugin into existing servers. But things really started to flourish when InnoDB went Open Source: it’s adoption increased and slowly it started gaining track and people’s thrust. With the code available for anyone to see, fix and extend, companies started committing their own resources into making InnoDB better, until it became the de facto storage engine of MySQL.
It’s true that today data stored (even compressed) into an InnoDB table takes more disk space compared to a similar MyISAM table but no one would expect there would be no trade-ins in the development of a new technology. At the same time, disk’s capacity were also increasing, which contributed to leverage the rate of bytes/$ and kind of compensated the gourmand needs of InnoDB.
But the increase in disk capacity also pushed the boundaries of what’s worth storing. What was for many the unlimited capacity of Gigabyte disks became, well, a limit, and then Terabyte disks developped into a “must-have”, a true need. At the same time though, with so many interesting things to see and surf, people’s attention became disputed and what was a commodity before transformed into a scarce good. Today, if a website takes more than a handful of seconds to load chances are it may loose the attention of some. SSD disks came to the rescue here, providing access to data in a fraction of the time a regular spindle disk would take. However, they don’t scale as well: the increase in cost for bytes/$ is proportional to the data access speed gains it brought and the longevity (or durability) of SSD disks is not as good, which makes for an expensive investment. It needs to be used wisely. For this reason, it’s becoming common to find a mixed use of fast and expensive SSD drives to store “popular” data and slower and cheaper spindle disks to store all the rest. Of course, this is a short term solution as it’s not that practical to maintain and requires a good amount of manual labor to decide which one stores what. In the long haul, it’s safe to predict SSD-based solutions will prosper as inexpensive storage but until then it is necessary to find a compromise between “big data” and hardware investment.
Another way of tackling this problem is changing the logical part. If one could store more data in the same amount of disk space and yet be able to store and retrieve it as fast, or even faster, then we would possibly get better results (in terms of performance) and a better return for the investment made in storage. That was the approach targeted by Tokutek in the development of the TokuDB storage engine. The core of it’s architecture is based in a different, modern indexing approach, baptized Fractal Tree Indexes (FTI). I say “different” because most popular storage engines such as MyISAM and InnoDB have a B-tree indexing foundation, which remained the somewhat “unchallenged” standard for the past three decades at least. And “modern” because it’s design take into account the write-intensive workloads we see rising more and more in contemporaneous data systems, as well as the “wear and tear” characteristic of the latest storage devices.
Both data structures are tree-based, store data in similar leaf nodes and make use of index key for ordering. But the way they manage and store data across the trees is different. TokuDB and its Fractal Tree structure make use of larger block sizes (bigger leaves) compared to InnoDB’s implementation of B-tree, which allows for better compression (the key for using less disk space) while also improving the performance of range queries. As important, TokuDB claims to make a better use of I/O by employing a message propagation system and an “optimal” buffering mechanism.
While in a traditional B-tree based system a change made in the table would reflect in an update in the index to acomodate it, TokuDB treats each change as a message at first. The interesting bit here is that even before the message reaches the respective leaf and modifies it, the changes it carries is already accounted by the database. It is like the database’s content is composed by the data found in the nodes plus the messages circulating in the tree. This brings agility to the storage engine and play an important role in providing Hot Schema Changes, for instance.
Regarding the optimized I/O buffering system, it is partly inherent to the use of bigger leaves. Or if you prefer, the other way around: the practical use of bigger leaves is made possible because buffers are used in a more efficient way. Efficiency here is measured according to bandwidth use. Remember, an I/O to disk is many times more expensive (in time) than an I/O to memory; that’s why buffers are used – you fill data into the buffer more frequently (less costly) so you can “flush” its content to disk less frequently (more costly). The fullest the buffer is when you flush it to disk, the more efficient use of bandwidth you’re doing. TokuDB’s tries to make the most out of it, “with hundreds or thousands of operations for a single I/O“. The problem of B-trees is that, by design, it’s difficult to implement an efficient buffering system and you tend to flush slightly filled buffers often. For this reason it is better to maintain smaller leafs in B-trees, which has the side effect of allowing for less good compression. Tim Callaghan, head of engineering at Tokutek, explained those differences much better than I can at Percona Live London last November and his slides are available here.
One scenario that benefits from this optimized use of I/O is write-intensive applications. We have recently been using TokuDB with our Percona Cloud Tools (PCT) service to store and analyze slow query logs from MySQL servers. The compression benefits were also a driven reason in the choice of TokuDB as the storage engine for PCT, without which we would be much more limited about the number of organizations we could accomodate in this beta phase of the service. How big is the compression impact ? Like everything else in MySQL it depends of your schema. Shlomi Noach reported he was able to convert 4 Terabytes worth of uncompressed InnoDB data (or 2 Terabytes of compressed InnoDB data using KEY_BLOCK_SIZE=8) down to 200 Gygabytes. It may be that impressive.
Compression alone is a huge attractive characteristic of TokuDB but the storage engine also fits well in scenarios where storage space is not an issue. The optimization in I/O can help lagging replicas where writes (INSERTS) are the limiting factor, and not the network. The “hot schema changes” functionality can be a bless if you need to add a column to a big table, or a secondary index. There’s also the non-less important impact on flash drives durability. Mark Callaghan commented the following in a past post in this blog: “For pure-disk servers the advantages of TokuDB versus InnoDB are limited to a few workloads. For pure-flash servers the advantages of TokuDB are universal — 2X better compression (compared to InnoDB compression on my data) and larger (more sequential) writes means that you will buy much less flash and either it will last longer or you can buy less-expensive flash and it will last long enough“. And let’s not forget Vadim’s favorite feature in TokuDB: following query progress live in SHOW PROCESSLIST.
Tokutek was clever in breaking with tradition and looking at the problem by another angle in the development of TokuDB. It bennefitted from the openess of MySQL and it’s storage engine API to implement a different solution, one that contemplates the reality of today – faster multi-core CPUs, modern but more “fragile” storage devices and a thirst for “big data.” Of course, it benefitted as well from observing how B-tree based storage engines coped with evolving data systems in the last decades and the development of new algorithms to come up with a new approach. And to make some things simpler along the way. It’s easier to tune TokuDB compared to InnoDB: I’ve counted 40 “tokudb_” variables while we find at least 100 more “innodb_“. But it has yet to endure the test of time. Even though we’re not talking about a brand-new storage engine (Vadim reported his first experiences with it 5 years ago) it has recently gone open source and community adoption is still in its initial phase, though steadily growing, as we can see by the number of opened bugs.
One thing that must worry many is the fact there’s no open source hot backup software for TokuDB. Even though there’s a community HotBackup API available on GitHub, which “is a specification for a pluggable backup utility,” the only hot backup working solution available today is bundled in the Enterprise Edition of TokuDB. And since the design of TokuDB doesn’t allow for a backup approach based in copying the database files and then applying the logs containing the changes made in the database during the backup, which is how MySQL Enterprise Backup and Xtrabackup works, there’s no hope to easily extend an existing open source software such as Percona XtraBackup to include TokuDB.
Hopefully we’ll see a new open source backup software implementing the available API in the near future but for now it seems the community is left with filesystem-level snapshoot-based tools such as mylvmbackup and xfs_freeze as the sole alternatives to the proprietary solution.