As a principal architect at Percona, one of my duties is to tune MySQL database servers for our customers. The tuning effort looks at every aspect of the database service like the operating system, the MySQL configuration, the schema, the queries, etc. We have well-defined processes to tune the operating system and the MySQL configuration. However, tuning the schema and the queries using it can be anywhere from trivial to extremely challenging.
The challenge with the schema and the queries is mostly with the indexes. The most common types of indexes are based on b-trees or hash lists. InnoDB doesn’t support hash indexes, a bummer for equality conditions. B-tree indexes are more general-purpose, they are decent for equality conditions and very good for range conditions. They are however quite heavy and very dependent on the defined order of columns. They are also poor for double range conditions.
Double range conditions can be handled by the GIS R-tree type index, a topic I’d like to explore in a future post. The focus of this post is FullText indexes but not for their primary use case like text documents. I want to show you how this type of index can be used to efficiently solve performance problems of a type of schema I’ll refer to as “Tag-based schema”. It is surprising how little attention fulltext indexes are receiving, I did a quick search on this blog and our latest post was from 2018.
What is a Tag-based schema?
At least from a high-level view, a database table is used to map the properties of an object or an entity to columns. The object can be a user or a relation in a social network database or a transaction in a credit card processing database. Some objects have few, very well-defined properties but others have many dozens if not hundreds of properties. A table with a hundred columns that may need indexes is impractical and very difficult to tune. Let’s have a look at a few examples.
Example: House rental
Let’s imagine you are designing the database for a house rental application. You’ll need a table to describe the houses you are offering for rent. In such a context, houses are objects with many properties describing their basic attributes, amenities, and surroundings. Let’s try to list the main ones:
Attributes: Number of rooms, number of bathrooms, private bathrooms, full kitchen, double beds, twin beds, parking spaces, meeting/workspace space, rating (stars), price ranges, region, etc.
Amenities: Air conditioning, swimming pool, jacuzzi, fireplace, pool table, wifi, home theater, cable tv, nice view, quiet (or not), non-smoking (or not), etc.
Surroundings: Beach, hiking, walking, fitness, skiing, restaurants, Bars, Spectacles, Theaters, parks, distance from the city center, distance from the airport, etc.
Example: Tree classification
A very different example is a database listing species of trees and how to identify them based on their general characteristics, their bark, leaves, seeds, etc.
Bark: Smooth, paper, peeling, ridges, scale, etc.
Leaves: SimpleSym, simpleAsym, multi-opposite, multi-alternate, multi-whorled, Needle flat, Needle round, lobed, toothed, etc
Seeds: Fruits, Nuts, Samaras, Acorns, Cones, Pods, etc.
These types of objects are difficult to map. Other than the wide approach (many columns) we discussed above, the other traditional approach is the entity value schema. The entity value schema uses one row per attribute, instead of being wide, we can say it is long. The entity value schema is flexible but also presents a number of tuning challenges. Let’s explore an alternative using fulltext indexes and tags. In this alternative, you store the attributes word elements (tags), concatenating them using a delimiter into a TEXT column.
Tags with a fulltext index
First, we need to define a dictionary table for the attributes/tags:
1 2 3 4 5 6 |
CREATE TABLE `dictTags` ( `tag` varchar(10) NOT NULL, `description` varchar(100) DEFAULT NULL, `groupTags` int unsigned NOT NULL DEFAULT '0', PRIMARY KEY (`tag`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci |
The table contains all the tags defined. A description is a human-readable version of the tag, GroupTags allows to set exclusive groups (only one value possible). The above table is a simple example but it can easily be expanded.
Here’s a sample tag list for a house or apartment rental site. Keep in mind I am not specialized at all in that type of business, I could easily be missing critical aspects. More realistically, there could be many hundreds of tags.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
mysql> select * from dictTags; +-------------+----------------------------------------+-----------+ | tag | description | groupTags | +-------------+----------------------------------------+-----------+ | 1bath | One bathroom | 1 | | 1bedr | One bedroom | 2 | | 1kmCenter | Less than 1 kilometer from city center | 6 | | 2bath | Two bathrooms | 1 | | 2bedr | Two bedrooms | 2 | | 2kmCenter | Less than 2 kilometer from city center | 6 | | 3bath | Three bathrooms | 1 | | 3bedr | Three bedrooms | 2 | | 3kmCenter | Less than 3 kilometer from city center | 6 | | 4bath | Four bathrooms | 1 | | 4bedr | Four bedrooms | 2 | | aircond | Air conditioning | 0 | | beach | Direct beach access | 0 | | City | Urban environment | 4 | | ctv | Cable TV | 0 | | Doublebed | Double beds | 5 | | firep | Fireplace | 0 | | hike | Hiking | 0 | | hometh | Home Theater | 0 | | Jacu | Jacuzzi | 0 | | Nature | Nature environment | 4 | | night | Nightlife options close by | 3 | | poolt | Pool table | 0 | | quiet | Quiet neighborhood | 3 | | sky | Skying options | 0 | | swim | Swimming pool | 0 | | Twinbed | Twin beds | 5 | | view | Nice view | 0 | | wifi | Wifi Internet | 0 | +-------------+----------------------------------------+-----------+ 29 rows in set (0.00 sec) |
We will store our properties to rent in this table:
1 2 3 4 5 6 7 8 |
CREATE TABLE `PropertiesToRent` ( `id` int unsigned NOT NULL AUTO_INCREMENT, `owner` int unsigned NOT NULL, `address` varchar(100) DEFAULT NULL, `tags` text, PRIMARY KEY (`id`), FULLTEXT KEY `idxFtTags` (`tags`) ) ENGINE=InnoDB AUTO_INCREMENT=2 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci |
In order to help manipulate the tags in a consistent way, we’ll create four handy SQL functions, hasTag, AddTag, AddTags, and removeTag. The SQL code for these functions is available on my GitHub. With these functions, let’s add a few properties.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
mysql>; insert into PropertiesToRent (id,owner,address,tags) values (1,1,'1 Québec street',addTags('4bedr,2bath,wifi,aircond,swim','')); Query OK, 1 row affected (0.00 sec) mysql>; insert into PropertiesToRent (id,owner,address,tags) values (2,1,'1 Ontario street',addTags('3bedr,2bath,wifi,Jacu,poolt','')); Query OK, 1 row affected (0.00 sec) mysql>; insert into PropertiesToRent (id,owner,address,tags) values (3,2,'1 Manitoba street',addTags('3bedr,3bath,sky,nature,ctv,Jacu','')); Query OK, 1 row affected (0.00 sec) mysql>; select * from PropertiesToRent; +----+-------+-------------------+---------------------------------+ | id | owner | address | tags | +----+-------+-------------------+---------------------------------+ | 1 | 1 | 1 Québec street | 4bedr,2bath,wifi,aircond,swim | | 2 | 1 | 1 Ontario street | 3bedr,2bath,wifi,Jacu,poolt | | 3 | 2 | 1 Manitoba street | 3bedr,3bath,sky,nature,ctv,Jacu | +----+-------+-------------------+---------------------------------+ 3 rows in set (0.00 sec) |
Now, we can use the fulltext index on tags to search the table. Let’s say you want to find out which properties have a Jacuzzi:
1 2 3 4 5 6 7 8 |
mysql> select * from PropertiesToRent where match (tags) against ('+Jacu' in Boolean mode); +----+-------+-------------------+---------------------------------+ | id | owner | address | tags | +----+-------+-------------------+---------------------------------+ | 2 | 1 | 1 Ontario street | 3bedr,2bath,wifi,Jacu,poolt | | 3 | 2 | 1 Manitoba street | 3bedr,3bath,sky,nature,ctv,Jacu | +----+-------+-------------------+---------------------------------+ 2 rows in set (0.00 sec) |
Do you also need to have a wifi connection?
1 2 3 4 5 6 7 |
mysql> select * from PropertiesToRent where match (tags) against ('+Jacu +wifi' in Boolean mode); +----+-------+------------------+-----------------------------+ | id | owner | address | tags | +----+-------+------------------+-----------------------------+ | 2 | 1 | 1 Ontario street | 3bedr,2bath,wifi,Jacu,poolt | +----+-------+------------------+-----------------------------+ 1 row in set (0.01 sec) |
Then you realize a swimming pool could replace the Jacuzzi:
1 2 3 4 5 6 7 8 |
mysql> select * from PropertiesToRent where match (tags) against ('+(Jacu swim) +wifi' in Boolean mode); +----+-------+------------------+-------------------------------+ | id | owner | address | tags | +----+-------+------------------+-------------------------------+ | 1 | 1 | 1 Québec street | 4bedr,2bath,wifi,aircond,swim | | 2 | 1 | 1 Ontario street | 3bedr,2bath,wifi,Jacu,poolt | +----+-------+------------------+-------------------------------+ 2 rows in set (0.00 sec) |
The weakness of this approach is that some attributes include other ones. For example, if you want a Property with at least three bedrooms, you’ll need to specify the additional attributes to include higher counts of bedrooms:
1 2 3 4 5 6 7 8 |
mysql> select * from PropertiesToRent where match (tags) against ('+(3bedr 4bedr) +(Jacu swim) +wifi' in Boolean mode); +----+-------+------------------+-------------------------------+ | id | owner | address | tags | +----+-------+------------------+-------------------------------+ | 1 | 1 | 1 Québec street | 4bedr,2bath,wifi,aircond,swim | | 2 | 1 | 1 Ontario street | 3bedr,2bath,wifi,Jacu,poolt | +----+-------+------------------+-------------------------------+ 2 rows in set (0.00 sec) |
Other implementations
My above example is fairly simple. As mentioned previously, there are two other implementations I can think of: wide and entity. Let’s look at what would be the equivalent queries of these alternate implementations.
Wide (many columns):
1 |
select * from PropertiesToRent where bedr >= 3 and (Jacuzzi = ‘y’ or swimingpool = ‘y’) and wifi = ‘y’; |
Entity (vertical):
1 2 3 4 |
select * from PropertiesToRent where id in (select propertyId from PropertyAttributes where AttributeName = ‘Bedroom’ and iValue > 3) and id in (select propertyId from PropertyAttributes where AttributeName in (‘Jaccuzzi’,’Swimminpool’) and bValue = ‘y’) and id in (select propertyId from PropertyAttributes where AttributeName in (’wifi’) and bValue = ‘y’); |
Both of these implementations have their own tuning challenges. The wide implementation is easy to deploy but hard to maintain and nearly impossible to index properly given the large number of columns that can be part of the where clause. You can easily expand the Entity implementation but then, the PropertyAttributes table will grow quite large. You must also have one column for every supported data type.
Conclusion
In this post, we used fulltext indexes to solve issues with schema mapping objects with a large number of attributes. This approach, using text tags to map the attributes is quite efficient and elegant. It is much easier to maintain than a wide approach mapping attributes to columns. It is also more efficient than the entity value schema where every attribute is a row from a table.
Fulltext indexes are less common than the b-tree indexes around which InnoDB is built. With the GIS R-tree indexes, there are among the less used index types of MySQL. Yet, they can be useful out of their intended initial scope, as we demonstrated in this post.
In a future post, I’d like to explore the use of R-tree indexes to solve another common SQL problem, the double range condition. If you are interested in unorthodox usages of database features, stay posted.