‘Indexing’ JSON documents for efficient MySQL queries over JSON data

PREVIOUS POST
NEXT POST

MySQL meets NoSQL with JSON UDF

I recently got back from FOSDEM, in Brussels, Belgium. While I was there I got to see a great talk by Sveta Smirnova, about her MySQL 5.7 Labs release JSON UDF functions. It is important to note that while the UDF come in a 5.7 release it is absolutely possible to compile and use the UDF with earlier versions of MySQL because the UDF interface has not changed for a long time. However, the UDF should still be considered alpha/preview level of quality and should not be used in production yet! For this example I am using Percona Server 5.6 with the UDF.

That being said, the proof-of-concept that I’m about to present here uses only one JSON function (JSON_EXTRACT) and it has worked well enough in my testing to present my idea here. The JSON functions will probably be GA sometime soon anyway, and this is a useful test of the JSON_EXTRACT function.

The UDF let you parse, search and manipulate JSON data inside of MySQL, bringing MySQL closer to the capabilities of a document store.

Since I am using Percona Server 5.6, I needed to compile and install the UDF. Here are the steps I took to compile the plugin:

  1. $ cd mysql-json-udfs-0.3.3-labs-json-udfs-src
  2. $ cmake -DMYSQL_DIR=/usr/local/mysql .
  3. $ sudo make install
  4. $ sudo cp *.so /usr/local/mysql/lib/plugin

JSON UDF are great, but what’s the problem

The JSON functions work very well for manipulating individual JSON objects, but like all other functions, using JSON_EXTRACT in the WHERE clause will result in a full table scan. This means the functions are virtually useless for searching through large volumes of JSON data.  If you want to use MySQL as a document store, this is going to limit the usefulness in the extreme as the ability to extract key/value pairs from JSON documents is powerful, but without indexing it can’t scale well.

What can be done to index JSON in MySQL for efficient access?

The JSON UDF provides a JSON_EXTRACT function which can pull data out of a JSON document. There are two ways we can use this function to “index” the JSON data.

  1. Add extra columns to the table (or use a separate table, or tables) containing the JSON and populate the columns using JSON_EXTRACT in a trigger. The downside is that this slows down inserts and modifications of the documents significantly.
  2. Use Flexviews materialized views to maintain an index table separately and asynchronously. The upside is that insertion/modification speed is not affected, but there is slight delay before index is populated. This is similar to eventual consistency in a document store.

Writing triggers is an exercise I’ll leave up to the user. The rest of this post will discuss using Flexviews materialized views to create a JSON index.

What is Flexviews?

Flexviews can create ‘incrementally refreshable’ materialized views. This means that the views are able to be refreshed efficiently using changes captured by FlexCDC, the change data capture tool that ships with Flexviews. Since the view can be refreshed fast, it is possible to refresh it frequently and have a low latency index, but not one perfectly in sync with the base table at all times.

The materialized view is a real table that is indexed to provide fast access. Flexviews includes a SQL_API, or a set of stored procedures for defining and maintaining materialized views.

See this set of slides for an overview of Flexviews: http://www.slideshare.net/MySQLGeek/flexviews-materialized-views-for-my-sql

Demo/POC using materialized view as an index

The first step to creating an incrementally refreshable materialized view with Flexviews, is to create a materialized view change log on all of the tables used in the view. The CREATE_MVLOG($schema, $table) function creates the log and FlexCDC will immediately being to collect changes into it.

Next, the materialized view name, and refresh type must be registered with the CREATE($schema, $mvname, $refreshtype) function:

Now one or more tables have to be added to the view using the ADD_TABLE($mvid, $schema, $table, $alias,$joinclause) function. This example will use only one table, but Flexviews supports joins too.

Expressions must be added to the view next. Since aggregation is not used in this example, the expressions should be ‘COLUMN’ type expressions. The function ADD_EXPR($mvid, $expression_type, $expression, $alias) is used to add expressions. Note that JSON_EXTRACT returns a TEXT column, so I’ve CAST the function to integer so that it can be indexed. Flexviews does not currently have a way to define prefix indexes.

I’ve also projected out the ‘id’ column from the table, which is the primary key. This ties the index entries to the original row, so that the original document can be retrieved.

Since we want to use the materialized view as an index, we need to index the columns we’ve added to it.

Finally, the view has to be created. There are 6 million rows in my table, the JSON functions are UDF so they are not as fast as built in functions, and I indexed a lot of things (six different indexes are being populated at once) so it takes some time to build the index:

After the materialized view is built, you can see it in the schema. Note there is also a delta table, which I will explain a bit later.

Here is the table definition of json_idx, our materialized view. You can see it is indexed:

Here are some sample contents. You can see the integer values extracted out of the JSON:

Now, there needs to be an easy way to use this index in a select statement. Since a JOIN is needed between the materialized view and the base table, a regular VIEW makes sense to access the data. We’ll call this the index view:

And just for completeness, here is the contents of a row from our new index view:

Using the UDF to find a document

The UDF does a full table scan, parsing all six million documents (TWICE!) as it goes along. Unsurprisingly, this is slow:

Using the index view to find a document

Keeping the index in sync

Flexviews materialized views need to be refreshed when the underlying table changes. Flexviews includes a REFRESH($mvid, $mode, $transaction_id) function.

I am going to remove one document from the table:

Note there is now one row in the materialized view change log. dml_type is -1 because it is a delete:

Now we can verify the materialized view is out of date:

To bring the index up to date we must refresh it. Usually you will use the ‘BOTH’ mode to ‘COMPUTE’ and ‘APPLY’ the changes at the same time, but I am going to use COMPUTE mode to show you what ends up in the delta table:

Delta tables are similar to materialized view change log tables, except they contain insertions and deletions to the view contents. In this case, you can see dml_type is -1 and id = 10000, so the row from the view corresponding to the row we deleted, will be deleted when the change is applied.

Finally the change can be applied:

Finally, it makes sense to try to keep the index in sync as quickly as possible using a MySQL event:

So there you have it. A way to index and quickly search through JSON documents and keep the index in sync automatically.

PREVIOUS POST
NEXT POST

Share this post

Comments (4)

  • Jaime Crespo Reply

    As an alternative to triggers, MySQL 5.7 will have computed/virtual columns, ideal for an implementation of indexed functions: http://www.slideshare.net/jynus/query-optimization-with-mysql-57-and-mariadb-10-even-newer-tricks/24

    February 17, 2015 at 12:05 pm
  • Peter Laursen Reply

    MariaDB had computed/virtual columns since version 5.2. I think it would be preferable (for integrity – not performance- reasons) to have the extracted content used forindexing in same table row/tupple/record as the JSON BLOB.

    But a TRIGGER in MySQL/MariaDB cannot write to the table ON which it is defined. Correct me if I am wrong.

    February 18, 2015 at 8:14 am
  • Justin Swanhart Reply

    As I recall, UDF (and stored procs) can’t be used in computed/virtual columns, but perhaps that has been rectified.

    As for triggers, they can modify the row that is being inserted (as long as it is a BEFORE trigger), but they can’t access the table itself. So you could create table Z (a int, b int, c varchar(100) default NULL);

    insert into Z (a,b) values (1,2);

    And in BEFORE INSERT trigger say:
    IF new.c IS NULL THEN SET new.c := ‘DIFFERENT DEFAULT VALUE’; END IF;

    then select * from Z would yield:
    1,2,’DIFFERENT DEFAULT VALUE’

    February 18, 2015 at 9:46 am
  • Fadi El-Eter (itoctopus) Reply

    Hi Justin,

    What do you think about using Sphinx to search for json data efficiently?

    Also, what kind of performance overhead will you have using your way on a high traffic/high write database?

    PS: Thanks for sharing this excellent post!

    February 19, 2015 at 11:20 am

Leave a Reply