MariaDB ColumnStoreRick Golba
Last month, MariaDB officially released MariaDB ColumnStore, their column store engine for MySQL. This post discusses what it is (and isn’t), why it matters and how you can approach a test of it.
What is ColumnStore?
ColumnStore is a storage engine that turns traditional MySQL storage concepts on their head. Instead of storing the data by row, a column store stores the data by column (obviously). This provides advantages for certain types of data, and certain types of queries run against that data. See my previous post for more details on column-based storage systems.
ColumnStore is a fork of InfiniDB and carries forward many of the concepts behind that product. InfiniDB ceased operations in 2014. With the front end managed through MariaDB, you get access to all of the expected security and audit options of MariaDB. MariaDB designed ColumnStore as a massively parallel database, working best in an environment with multiple servers. This is somewhat different than a traditional row store database.
ColumnStore stores columnar data in a concept called an “extent.” An extent contains a range of values for a single column. Each extent contains no more than 8 million values. It stores additional values in a new extent. The extents for a single column get distributed across the database nodes, known as “Performance Modules” in ColumnStore. It stores each unique extent on more than one node, thus providing data redundancy and removing the need for replication. If a node is down, and it contains an extent needed for a query, that same extent is found on another node in the environment. This data redundancy also provides a high availability environment.
The query engine determines which extents process query requests. Since the data in an extent is often preordered (time series data, for example), many queries can ignore individual extents since they cannot contain any data needed for the query. If we are only looking for data from February 2017, for example, extents containing data outside of that range get ignored. However, if a query requires data from many or all extents on a single column, the query takes much longer to complete.
Unlike some traditional column store vendors, that take an all or nothing approach to storage, MariaDB decided to go with a mixed concept. In a MariaDB MySQL database, you can mix traditional InnoDB storage with the new ColumnStore storage, just like you used to mix InnoDB and MyISAM. This presents some nice options, not the least of which is that it provides a way to “dip your toe” into the world of column stores. On the other hand, it could lead to people using the ColumnStore engine in non-optimal ways. Also, the differences in what is considered optimal architecture between these two storage options make it hard to see how this plays out in the real world.
As discussed in the earlier post, column storage works great for specific types of data and queries. It is important that your data definitions are as tight as possible, and that your queries are appropriate for column-based data.
Many people set their field definition as VARCHAR(256) when setting up a new database. They might not know what type of data gets stored in the new field. This broad definition allows you to store whatever you throw at the database. The negative effect for row store is that it can cause over-allocation of storage – but it only has a minimal effect on queries.
In a column store, the field definition can drive decisions about the compression methods for storing the data, along with sorting implications. Columnar data can use storage more efficiently than a row store, since the data for a single column is well-defined. This leads to selecting the best compression algorithm for the data. If that data is poorly defined, the selected compression algorithm might not be the best for the data.
Sorting is also a problem in a column store when the data types are not well-defined. We’ve all seen integer or date data that is sorted alphabetically. While it can be annoying, we can still adjust to that sorting method to find what we need. Since a column store is often used to perform analytical queries over a range of data, this poorly-sorted data can present a bigger problem. If you specify a column to be VARCHAR and only include date information, that data is sorted alphabetically. The same column defined as DATE causes the data to be sorted by date. This chart shows the difference (date format is mm/dd/yy)
|Alphabetic Sort||Date Sort|
Imagine running a query over a range of dates (requesting all activity in the months of January and February 2017, for example). In the alphabetic sort, this requires working through the whole file, since the data for November shows up between the data for January and February. In the date sort, the query only reads the until the end of February. We know there can be no more matching data after that. The alphabetic sort leads to more I/O, more query time and less happiness on the part of the user.
Why Should You Care About ColumnStore?
The first reason is that it allows you to try out column storage without doing a massive shift in technology and with minimal effort. By setting up some tables in a MariaDB database to use the ColumnStore engine, you can benefit from the storage efficiencies and faster query capabilities, provided that the data you’ve selected for this purpose is sound. This means that the data definitions should be tight (always a good plan anyway), and the queries should be more analytical than transactional. For a purely transactional workflow, a row store is the logical choice. For a purely analytical workflow, a column store is the logical choice. ColumnStore allows you to easily mix the two storage options so that you can have the best match possible. It is still important to know what type of workflow you’re dealing with, and match the storage engine to that need.
Another solid reason is that it is a great fit if you are already doing analysis over massive amounts of data. Any column store shines when asked to look at relatively few columns of data (ideally the column or two that are being aggregated and other columns to locate and group the data). If you are already running these types of queries in MySQL, ColumnStore would likely be a good fit.
But There Be Dragons!
As with any new technology, ColumnStore might not be a good fit for everyone. Given that you can mix and match your storage engines, with ColumnStore for some tables and InnoDB for others, it can be tempting to just go ahead with a ColumnStore test doing things the same way you always did in the past. While this still yields results, those results might not be a true test of the technology. It’s like trying to drive your minivan the same way you used to drive your sports car. “Hey, my Alfa Romeo never flipped over taking these turns at high speed!”
To effectively use ColumnStore, it’s important to match it to a proper analytical workload. This means that you will likely do more bulk loading into these tables, since there is additional overhead in writing the data out into the column files. The overall workflow should be more read-intensive. The queries should only look for data from a small set of fields, but can span massive amounts of data within a single column. In my earlier post, there’s also a discussion about normalization of data and how denormalizing data is more common in columnar databases.
You should address these issues in your testing for a valid conclusion.
The minimum specifications for ColumnStore also point to a need for a more advanced infrastructure than is often seen for transactional data. This is to support batch loading, read intensive workloads and possibly different ETL processes for each type of data. In fact, MariaDB states in the installation documentation for ColumnStore that it must be completed as a new installation of MariaDB. You must remove any existing installations of MariaDB or MySQL before installing the ColumnStore-enabled RPM on a system.
Is It Right for Me?
ColumnStore might fit well into your organization. But like haggis, it’s not for everyone. If you need analytical queries, it is a great option. If your workload is more read-intensive, it could still work for you. As we move to a more Internet of Things (IoT) world, we’re likely to see a need for more of this type of query work. In order to accurately present each user with the best possible Internet experience, we might want to analyze their activities over spans of time and come up with the best match for future needs.
Seriously consider if making the move to ColumnStore is right for you. It is newer software (version 1.0.6, the first GA version, was released on December 14, 2016, and 1.0.7 was released on January 23, 2017), so it might go through changes as it matures. Though a new product, it is based on InfiniDB concepts (which are somewhat dated). MariaDB has lots of plans for additional integrations and support for ancillary products that are absent in the current release.
MariaDB took a huge step forward with ColumnStore. But do yourself a favor and consider whether it is right for you before testing it out. Also, make sure that you are not trying to force your current workflow into the column store box. Kids know that we cannot put a square peg in a round hole, but we adults often try to do just that. Finding the right peg saves you lots of time, hassle and annoyance.