Advanced JSON to MySQL indexing

PREVIOUS POST
NEXT POST

This post will discuss some excellent methods of getting JSON to MySQL indexing to work smoothly.

What is JSON

JSON is an text based, human readable format for transmitting data between systems, for serializing objects and for storing document store data for documents that have different attributes/schema for each document. Popular document store databases use JSON (and the related BSON) for storing and transmitting data.

Problems with JSON in MySQL

It is difficult to inter-operate between MySQL and MongoDB (or other document databases) because JSON has traditionally been very difficult to work with. Up until recently, JSON is just a TEXT document. I said up until recently, so what has changed? The biggest thing is that there are new JSON UDF by Sveta Smirnova, which are part of the MySQL 5.7 Labs releases. Currently the JSON UDF are up to version 0.0.4. While these new UDF are a welcome edition to the MySQL database, they don’t solve the really tough JSON problems we face.

Searching

The JSON UDF provide a number of functions that make working with JSON easier, including the ability to extract portions of a document, or search a document for a particular key. That being said, you can’t use JSON_EXTRACT() or JSON_SEARCH in the WHERE clause, because it will initiate a dreaded full-table-scan (what MongoDB would call a full collection scan). This is a big problem and common wisdom is that JSON can’t be indexed for efficient WHERE clauses, especially sub-documents like arrays or objects within the JSON.

Actually, however, I’ve come up with a technique to effectively index JSON to MySQL data (to any depth). The key lies in transforming the JSON from a format that is not easily indexed into one that is easily indexed. Now, when you think index you think B-TREE or HASH indexes (or bitmap indexes) but MySQL also supports FULLTEXT indexes.

A fulltext index is an inverted index where words (tokens) point to documents. While text indexes are great, they aren’t normally usable for JSON. The reason is, MySQL splits words on whitespace and non-alphanumeric characters. A JSON document doesn’t end up being usable when the name of the field (the key) can’t be associated with the value. But what if we transform the JSON? You can “flatten” the JSON down into key/value pairs and use a text index to associate the key/value pairs with the document. I created a UDF called RAPID_FLATTEN_JSON using the C++ Rapid JSON library. The UDF flattens JSON documents down into key/value pairs for the specific purpose of indexing.

Here is an example JSON document:

Flattened:

Obviously this is useful, because our keys are now attached to our values in an easily searchable way. All you need to do is store the flattened version of the JSON in another field (or another table), and index it with a FULLTEXT index to make it searchable. But wait, there is one more big problem: MySQL will split words on the equal sign. We don’t want this as it removes the locality of the keyword and the value. To fix this problem you’ll have to undertake the (actually quite easy) step of adding a new collation to MySQL (I called mine ft_kvpair_ci). I added equal (=) to the list of lower case characters as described in the manual. You just have to change two text files, no need to recompile the server or anything, and as I said, it is pretty easy. Let me know if you get stuck on this step and I can show you the 5.6.22 files I modified.

By the way, I used a UDF, because MySQL FULLTEXT indexes don’t support pluggable parsers for InnoDB until 5.7. This will be much cleaner in 5.7 with a parser plugin and there will be no need to maintain an extra column.

Using the solution:
Given a table full of complex json:

Add a column for the index data and FULLTEXT index it:

Then populate the index. Note that you can create a trigger to keep the second column in sync, I let that up to an exercise of the reader, or you can use Flexviews to maintain a copy in a second table automatically.

Using the index:

The documents I searched for that example are very complex and highly nested. Check out the full matching documents for the query here here

If you want to only index a subportion of the document, use the MySQL UDF JSON_EXTRACT to extract the portion you want to index, and only flatten that.

Aggregating

JSON documents may contain sub-documents as mentioned a moment ago. JSON_EXTRACT can extract a portion of a document, but it is still a text document. There is no function that can extract ALL of a particular key (like invoice_price) and aggregate the results. So, if you have a document called orders which contains a varying number of items and their prices, it is very difficult (if not impossible) to use the JSON UDF to aggregate a “total sales” figure from all the order documents.

To solve this problem, I created another UDF called RAPID_EXTRACT_ALL(json, ‘key’). This UDF will extract all the values for the given key. For example, if there are 10 line items with invoice_id: 30, it will extract the value (30 in this case) for each item. This UDF returns each item separated by newline. I created a few stored routines called jsum, jmin, jmax, jcount, and javg. They can process the output of rapid_extract_all and aggregate it. If you want to only RAPID_EXTRACT_ALL from a portion of a document, extract that portion with the MySQL UDF JSON_EXTRACT first, then process that with RAPID_EXTRACT_ALL.

For example:

Aggregating all of the id values in the entire collection:

Of course you could extract other fields and sort and group on them.

Where to get the tools:
Hopefully this blog helps you when working with JSON to MySQL indexing. You can find the UDF in the swanhart-tools github repo. I think you will find these tools very useful in working with JSON to MySQL documents.

(This post was originally posted on my personal blog: swanhart.livejournal.com, but is reposed here for wider distribution)

PREVIOUS POST
NEXT POST

Share this post

Comments (4)

  • Fadi El-Eter (itoctopus) Reply

    Hi Justin,

    Thanks for the post.

    Personally, if I want to search a JSON field, then I use Sphinx. The indexing will take a few minutes (of course, depending on the size of the table), but the search will be super quick and the load will be 0 on the database server.

    Quick opinion: I don’t think it’s a good idea to store JSON data in the database – but so many developers now do it, so I guess we’ll have to live with it.

    March 10, 2015 at 11:59 pm
  • Justin Swanhart Reply

    Yes, sphinx is a fine text index, and can be used similarly, but I don’t see how it can be used with JSON out of the box given JSON structure. You still transform it in some way, or does sphinx have a JSON parser built in? The other part of this blog post, aggregation, can’t really be used with Sphinx.

    I think the biggest use case is taking an existing json data set, or a mongodb data set, and being able to do processing on MySQL where you can do things like join the json to other data etc. There are also cases where shchemaless approach really is best, and being able to index JSON, update JSON and aggregate JSON in MySQL are really important from that perspective.

    March 11, 2015 at 1:24 am
  • Fadi El-Eter (itoctopus) Reply

    Hi Justin,

    We still have to transform the data prior to searching it…

    March 12, 2015 at 3:45 pm
  • Mario Uher Reply

    Hey Justin,

    would you be so kind to post your edited files? I tried it several times, however MySQL refuses to accept my changes. I just want to eliminate the chance that my files are somewhat broken.

    TIA,
    -Mario

    June 8, 2015 at 4:05 am

Leave a Reply