Fractal Tree library as a Key-Value store

July 20, 2015

Author

Vadim Tkachenko

MongoDB

Percona Software

Share this Post:

As you may know, Tokutek is now part of Percona and I would like to explain some internals of TokuDB and TokuMX – what performance benefits they bring, along with further optimizations we are working on.

However, before going into deep details, I feel it is needed to explain the fundamentals of Key-Value store, and how Fractal Tree handles it.

Before that, allow me to say that I hear opinions that the “Fractal Tree” name does not reflect an internal structure and looks more like a marketing term than a technical one. I will not go into this discussion and will keep using name “Fractal Tree” just out of the respect to inventors. I think they are in a position to name their invention with any name they want.

So with that said, the Fractal Tree library implements a new data structure for a more efficient handling (with main focus on insertion, but more on this later) of Key-Value store.

You may question how Key-Value is related to SQL Transactional databases – this is more from the NOSQL world. Partially this is true, and Fractal Tree Key-Value library is successfully used in Percona TokuMX (based on MongoDB 2.4) and Percona TokuMXse (storage engine for MongoDB 3.0) products.

But if we look on a Key-Value store in general, actually it maybe a good fit to use in structural databases. To explain this, let’s take a look in Key-Value details.

So what is Key-Value data structure?

We will use a notation (k,v), or key=>val, which basically mean we associate some value “v” with a key “k”. For software developers following analogies may be close:
key-value access is implemented as dictionary in Python, associative array in PHP or map in C++.
(More details in Wikipedia)

I will define key-value structure as a list of pairs (k,v).

It is important to note that both key and value cannot be just scalars (single value), but to be compound.
That is "k1, k2, k3 => v1, v2", which we can read as (give me two values by a 3-part key).

This brings us closer to a database table structure.
If we apply additional requirement that all (k) in list (k,v) must be unique, this will represent
a PRIMARY KEY for a traditional database table.
To understand this better, let’s take a look on following table:
CREATE TABLE metrics ( ts timestamp, device_id int, metric_id int, cnt int, val double, PRIMARY KEY (ts, device_id, metric_id), KEY metric_id (metric_id, ts), KEY device_id (device_id, ts) )

We can state that Key-Value structure (ts, device_id, metric_id => cnt, val), with a requirement
"ts, device_id, metric_id" to be unique, represents PRIMARY KEY for this table, actually this is how InnoDB (and TokuDB for this matter) stores data internally.

Secondary indexes also can be represented in Key=>Value notion, for example, again, how it is used in TokuDB and InnoDB:
(seconday_index_key=>primary_key), where a key for a secondary index points to a primary key (so later we can get values by looking up primary key). Please note that that seconday_index_key may not be unique (unless we add an UNIQUE constraint to a secondary index).

Or if we take again our table, the secondary keys are defined as
(metric_id, ts => ts, device_id, metric_id)
and
(device_id, ts => ts, device_id, metric_id)

It is expected from a Key-Value storage to support basic data manipulation and extraction operations, such as:

(key => value)

(key => value2)

"value"

"key"

(key => value)

"value"

"key"

and I want to add fifth operation:

"key > 5"

"key >= 10 and key < 15"

They way software implements an internal structure of Key-Value store defines the performance of mentioned operations, and especially if datasize of a store grows over a memory capacity.

For the decades, the most popular data structure to represent Key-Value store on disk is B-Tree, and within the reason. I won’t go into B-Tree details (see for example https://en.wikipedia.org/wiki/B-tree), but it provides probably the best possible time for Lookup operations. However it has challenges when it comes to Insert operations.

And this is an area where newcomers to Fractal Tree and LSM-tree (https://en.wikipedia.org/wiki/Log-structured_merge-tree) propose structures which provide a better performance for Insert operations (often at the expense of Lookup/Select operation, which may become slower).

To get familiar with LSM-tree (this is a structure used by RocksDB) I recommend http://www.benstopford.com/2015/02/14/log-structured-merge-trees/. And as for Fractal Tree I am going to cover details in following posts.