Mo’ Data, Mo’ ProblemsJon Tobin
Welcome to blog #2 in a series about the benefits of the Fractal Tree. In this post, I’ll be explaining Big Data, why it poses such a problem and how Tokutek can help. Given the fact that I am a lifelong fan of both Hip-hop and Big Data, the title was a no-brainer and, given the artist, a bit of a pun.
I am as tired as you of hearing the term “Big Data.” It’s so overused, that it ceases to have specific meaning anymore. You see, data hardly ever starts as “big” or a “problem.” Rather, it starts small and easily manageable, but gradually grows to some unimaginable size and becomes a beast in need of slaying, like the irradiated ant from a sci-fi film, growing to the size of a cruise ship. The nature of tackling such a tough problem means that the initial understanding of the factors involved is, oftentimes, incomplete at best; Catch-22 exemplified. During the course of problem it is common to realize that the number of data sources that you’ll need to ingest will increase, which means more data to store. With more data sources, what you had originally envisioned as a “data drip” has suddenly become a “data watermain.” Don’t even get me started on the difficulties of crunching large and quickly changing datasets. Let me give you a quick example of the evolution of these projects:
Imagine that you want to undertake helping people sleep better by collecting and analyzing data on sleep quality. You start by developing a device that captures heart rate and motion at 2 minute intervals during the sleep period. Quickly, you realize that you can only tell them how they’ve historically slept using that data, so you decide to collect heart rate at a shorter interval and you gather galvanic skin response (sweat). You quickly realize that you still can’t give suggestions for better sleep, so you add the ability to track food intake and also monitor brain activity. This situation, while contrived, is an illustration of how these projects evolve as understanding of the true problem and factors involved change over time. It’s the absence of understanding that makes the project worth undertaking, but also makes the boundaries undefined.
The common approach to dealing with the increased volume and velocity of Big Data is to scale out or “shard”; dividing the workload up to increase the overall pool of resources. However, environments go from relatively small to monstrously large very quickly, which also make management exponentially harder. Oftentimes, this strategy is used to get around a single bottleneck (usually memory), leaving the additional server resources (CPU, disk, etc) entombed, never to provide a return on investment. From the business perspective, CapEx (infrastructure) and OpEx (management, power, etc) grow at an unpalatable rate, diminishing the return on any benefits the project can possibly provide, again Catch-22.
As I explained in the last post, the Fractal Tree was invented to bring efficiency to these types of environments. The biggest feature of our implementations is the fact that we decided to integrate the Fractal Tree into two of the biggest open source databases, instead of creating our own database. This means you DO NOT need to re-platform your application to get increased efficiency. By “efficiency,” I mean the Fractal Tree gives you the ability to get more “work” from each unit of hardware that you put in. We’ve proven the ability of the Fractal Tree to give up to 20x performance improvement over traditional (B-tree) indexing strategies while reducing the data storage footprint via compression.
For the sake of argument though, lets say that we could only increase the performance for every unit of hardware by 2x. Would reducing your infrastructure, power and, possibly, management costs by ½ just by switching your indexing technology (no application code change) be attractive to you?
Let’s examine the properties of the Fractal Tree that make Big Data more efficient (click on the heading to see proof):
Two of the 3 “Vs” of Big Data are Volume and Velocity, more simply, how much and how fast. If you look at the blue line in the chart below, you can see what happens to the B-tree once the dataset becomes larger than memory (around 25M rows). Without going into painful detail, the Fractal Tree keeps insert performance running predictably, at memory speeds, by managing memory with the data structure itself; visualize a “champagne waterfall” (more in the next paragraph). This makes your database more “storage performance” efficient as well. What I mean is that the Fractal Tree actually amortizes the cost of one disk operation over many database operations, which is the opposite of the B-tree. Whereas, previously, your database performance was bottlenecked by how fast your disks could store information, now, your database can continually ingest as fast as your memory can accept. Furthermore, performance will be very predictable, helping you to estimate when you’ll need to scale out.
Database compression has been available for many years, but has always required decreased performance…performance OR compression. The same “amortization” property that I described above enables the Fractal Tree to give you performance AND compression. If you visualize the champagne waterfall that I previously described, it will help to understand this. Consider the database operations to be the champagne coming out of the bottle and the glass pyramid to be server memory. The database operations go into the top of the memory and over time are “pushed” down to the bottom of the memory. As soon as the database operations are poured into the top glass, the Fractal Tree tells the database to move on. Over time, we’ll have a bunch of data sitting in the bottom glasses, we simply compress that data and write it to disk in large chunks. Please keep in mind this is a very simplified explanation, but gives a basic understanding of how the Fractal Tree keeps performance consistently fast.
I’ve already covered our increased insert performance, but we can significantly increase query performance as well. This has to do with our ability to index information quickly. An index is a way to dramatically increase the speed of your queries. Think of looking through a phone book (if you can remember those days), it’s easy to do because you know that it’s ordered alphabetically. The alphabet is just like a index. Now, imagine that i scrambled every name on every page in phone book so that it’s completely random. In order to find a specific phone number, you’ll need to scan through every name on every page in the book in order to find a record. This would be effective, but extremely slow and inefficient.
A secondary index is way of reordering information in another way so that it’s easy to find. Every additional index uses the same indexing method, meaning every other index you define is a Fractal Tree (champagne waterfall). Since the Fractal Tree provides great insert speed, you can define more of them without effecting insert performance, and thus, increasing query efficiency. With B-trees extra indexes mean work multiplies and the database gets slower; query speed OR insert speed. This leads me into our next feature…
Replication is used in an overwhelming majority of databases because most data that’s worth storing is worth being protected. If a single server catches on fire, you don’t want the only copy of the database to go up with it. Thus, databases send their data to other databases on a separate server, which gives you another copy of the data and a server ready to take over should the “primary” fail. The problem with this is that it’s a CapEx multiplier and your valuable resources are providing no return to the business unless something catastrophic has happened. Furthermore, with traditional replication the server is forced to “replay” the exact workload on the secondary server. This is very similar to the process of hand copying a never ending book prior to the invention of the Gutenberg printing press. On both of our products, we’ve modified the way the secondary servers handle incoming data. This means that your servers now will have a majority of your resources available to service queries or what is termed “read scale.” Now, you can use your secondaries to help ask questions of the data you’re ingesting without affecting their ability to take over should something unexpected happen to the primary or you can use less powerful (and expensive) servers for secondaries.
Hopefully, you can see how we can drastically increase the efficiency of your MySQL and/or MongoDB environment.
Like what you see, don’t believe me? Sign up for the Tokutek Challenge and let us put our money where our mouth is!