Proper indexing is a key ingredient of database performance, and MySQL is no exception. Percona CEO Peter Zaitsev discusses new indexing techniques based on MySQL 5.6 optimizer improvements.


Alisa Baum:

Hello, everyone, my name is Alisa Baum, and I am Percona's Director of Product Marketing. We will begin in a moment, but first I'd like to conduct just a little bit of housekeeping. First, please raise your hand using the hand icon located in the Go to Webinar control panel to let me know that you can hear me all right. Okay, great, I see a bunch of hands. Thank you very much.

Next, during this webinar you will be on mute. Should you have any questions during the discussion, please enter them in the questions field within the control panel. At the end of the webinar, we'll take time to answer as many questions as possible, and those that aren't addressed will be answered in a follow-up blog entry on Percona's MySQL Performance Blog.

In addition, a recording of this webinar will be made available to everyone within 48 hours. I'd like to thank you for attending today's webinar, MySQL Indexing: Best Practices for MySQL 5.6 presented by Percona CEO and founder Peter Zaitsev. In addition, Matt Griffin, who is Percona's Director of Product Management, will be joining Peter for the questions and answer part of the show, and with that said, I'd like to turn the floor over to Peter. Go ahead, Peter.

Peter Zaitsev:

Thank you Alisa. Today we are going to talk about MySQL indexing and best practices, and we'll specifically focus on changes in MySQL 5.6. This is a great topic to discuss, because indexing is a skill that is crucial both for MySQL developers and DBAs. Poor index choices are responsible for a very large number of production problems.

Broken down, indexing is not really a rocket science. If you digest what you will learn in this presentation, I believe you will be in the top 5 percent of MySQL DBAs or developers when it comes to understanding MySQL indexing.

This presentation will discuss three main focuses of indexing. First, we will develop an understanding of indexes and how they work internally. Next, we look at the approach that can be can used to set up indexing for applications, and then finally, we'll look at working around some common MySQL limitations when it comes to the ability to use indexes.

First, let's think about indexes: what are indexes in a nutshell? An index really is a database structure, which is created entirely to speed up access to a database to make queries run faster. If you take your database and go ahead and drop all of the indexes out there, and then run a select query; any select queries you are running will still be able to run, but they may not be able to complete in your lifetime.

Indexes also help us to enforce constraints such as unique, foreign keys, but these are really conceptually different data structures which aren't just typically implemented using indexes. There are a number of indexes, which exist in the wild. There are B-tree indexes which are the most common index type, and on which we will focus our attention.

There are R-Tree indexes in MySQL used for data which has been commonly MyISAM only, but which are also being implemented for InnoDB tables for MySQL 5.7, so I'm sure we'll be looking more at those index types.

There are HASH indexes, which exist in memory and NDB tables which are a very good choice for point look-up queries. There are also full-text search indexes, which are helpful to optimize our full-text search, natural language search queries which exist for MyISAM tables and InnoDB indexes starting with MySQL 5.6.

As you see, there are a whole bunch of indexes which exist out there. The good thing is, as a DBA or developer, I believe that more than 90 percent of your time will be spent working with the B-tree indexes, and this is what we are going to focus on for this presentation.

When we record the B-tree indexes, if you create the index with MySQL with the B-tree, we are really speaking about a family of different types of implementation. These all share similar attributes --they are similar to B-tree, however they are really two B-trees.

Depending on the storage engine, you may be looking at the B-tree indexes, B+Trees, T-tree and NDB storage engines or red-black trees for the in-memory storage engine. Important to know from a practical implementation, is that with regard to what kind of queries those indexes can and cannot optimize, they behave essentially the same. That is why we're all using the same B-tree name when creating an index.

In this presentation, I am using the B+Tree example because this is what InnoDB, the most commonly used storage engine, uses. As you can see, the B+Tree has an essentially tree-like structure. This includes a different structure for its branch nodes and/or root nodes, along with leaf nodes. The difference here comes with our branch nodes. A root node contains a pointer to other pages, while the leaf nodes have data pointers or in the case of InnoDB, the primary key can contain the data.

In this presentation’s example, we have multiple pointers going from our root page to the branch, and then leaf pages which allow us to find any data using their descent in the tree. What you also see in this tree is the pointer from a leaf node which goes one to the next. This is an important optimization which allows us to do range scans in the B-tree indexes without having to necessarily go back up the tree and go back down again.

What is interesting when referring to InnoDB tables in particular, is that we actually have a pointer within their leaf node pages following two directions, both forward and back. That means InnoDB tables are optimized for range scans both in forward and reverse order the same way. Finally, if you would note, of the four pages that are displayed here, you can note what often contains three values, while some contain only two.

This example illustrates the reality of B-Trees. They would often have free space in them, and that is somewhere between 25 to 50 percent of the total space in the B-Tree index pages. Some data can be added without re-balancing the tree. In reality, of course, the B-Tree uses pages which are significantly larger than displayed here, so for a typical index you may be looking for hundreds, or in some cases, maybe even as much as thousands of elements stored on this single InnoDB page. In most cases, the number of levels in a B-Tree is not very large.

Looking at the indexes for MyISAM and InnoDB tables, they are a little bit different. In MyISAM, data pointers point to physical objects in a data file, so all the indexes are essentially equivalent-- whether that's a primary key, unique key or secondary key; they're all same in the way that they behave.

In InnoDB, there is a primary key, either implicit (which was created for you) or explicit (which you create manually). This is special because it stores data in the leaf pages, not a pointer to the data. The secondary index stores the primary key as a data pointer.

Moving on from understanding indexes, let's think about what kind of operation our B-Tree index can optimize from the point of view of a lookup, such as a simple WHERE clause. First is a point lookup, or equality constraint, when you look for everything that contains keys equal to a certain value.

You can look at the ranges which are open from one side, something with a key, or more than five, or closed ranges where a key belongs to the range between 2 values. These are essentially the most important types that the B-Tree index can optimize. There are some areas where a B-Tree index cannot really help us.

If you'd want to look at all rows where the last digit of the key is zero or the last letter of the name is "A", B-Tree indexes cannot help us to find this data faster, because it can't be defined as any kind of range operation.

When we speak about strings, they work similarly to numbers because we can take a look at the strings and define the sort order within them, which is called a collation in MySQL. The different languages, the sort rules being different--that's called the collation being different. This defines the sort order within the strings, which are needed to create an index.

With strings, we can have a ‘like’ operator, and a like operator that takes a prefix actually is changed internally, becoming a range. The range can be optimized quite well. Similar to a fixed suffix, the percentage sign or underscore and the star - these can't be optimized well. This is one of the nastiest queries into which you can run. It would be better to do a percent search string, instead of a keyword and then percent search, because those typically cause full tables to be very inefficient.

Let’s move on to multiple column indexes. With multiple column indexes, we have two or more columns, almost concatenated, and the sort order within those define how the first column is compared, then the second, the third, and so on and so forth.

It is important to understand that for multiple column indexes, there is still one B-Tree that is created, not any kind of a fancy structure where there is a separate B-Tree for each level. There is only one B-Tree where the comparison rule is applied, or from the first column to the next in the index.

One other issue to note is that indexes are not free. They're actually quite costly, so you don't want to add more indexes than you need for your application to work well. Typically, you are thinking just about the number of indexes, and then about how many columns the index has. If you have a choice of adding another index or extending an existing one with another column, extending is the better choice in the majority of cases.

Indexes have overhead both in the reads and writes. It's really well-understood for writes because updating the data often means updating indexes as well. The more indexes you have, the more expensive it becomes. It also makes reads more expensive. You cannot optimize those indexes, because wasted indexes mean wasted space on disk and memory, as well as additional overhead for the MySQL optimizer. This would already have gone through all existing indexes in many cases, and would have considered them for execution. If you have those indexes there will be more overhead.

When speaking about the cost of indexes, there are multiple things impacting this. For example, a long primary key is rather expensive because it makes all secondary keys longer, since they contain the primary key to refer data. A random primary key for InnoDB is bad because it is a clustered key, and random insertions will cause a lot of page splits. This will result in a very fragmented and poorly performing table.

I will often refer to something as being “random”. I mean that if you have a column with SHA1 or MD5 values as a primary key, this is a pretty bad choice because it spreads the data across all the pages for keys.

Now, having said this, longer indexes are generally slower. If you have an index on 255 characters, in a very long string, it will be longer and slower than the index on a couple of integers. Even when it's not just a primary key, indexes with insertion and random order are more expensive. Indexes with sequential inserts, which are correlated with your primary key values can be cheap.

If you have an auto-increment ID and insert time that grows sequentially with auto-increment value, it wouldn't be very expensive to keep. Also, to specially log selective activity, such as an index on gender or state, these are significantly less expensive than passive activity.

In using InnoDB tables, it also makes sense for us to get the most out of our primary key. We want to pick a primary key which suits us best, keeping in mind that lookups on a primary key will be fastest for InnoDB tables, especially for ranges.

One of the very common tricks we've been using is when you have something like messages or comments on posts. It is very powerful to create a primary key which contains something like Post_ID,Comment_ID. This allows us to keep all the comments essentially clustered together in one or two InnoDB pages, which drives can read very quickly.

In some frameworks you may do much better with just a single integer key. In those cases you can build a single key by using BIGINT column and storing 2 values in higher and lower 32bits as a workaround. Another thing to consider is that in most recent MySQL versions, the primary key is implicitly appended to all the indexes in Innodb tables.

You now have a primary key ID and then you have Key A. The key internally will be as good as A,ID and you can use that for sorting or lookups. So for example, if A equals 5 and ID equals more than 100, this can be resolved very well using just an index on A. You don't have to have an index on A,ID.

Now let's move on and talk about how MySQL uses those indexes. There are a lot of ways, including data lookups, (perhaps the most important one), sorting, avoiding reading the data, and some special optimizations as well.

How do you use indexes for data lookups? Well, let's look at this simple example: select from employees looking up their last name in a database. Having an index on “last name” would help us to run this query in the most useful way. We may also want to lookup by last name and department. In this example, a combined index allows a query to run most efficiently, when indexed on a (department,last name) or (last name,department) for this given query.

How can indexes can be used if you have an index with multiple columns? Let's say you have index A, B, C exactly in this column order. Column order matters a lot when you have indexes. This will have three parts. First is the index, which will be used for all listed key parts. You can see that the index can be used for all columns, even in the last mentioned case where A is an equality, B is in a list of values, and C is a range.

It is worth pointing out a second common case where MySQL will not be able to use an index. An example of this would be: B having a value more than 5, because the leading column is not referenced, and MySQL can only use the indexes starting from the first column, then add other columns to the index used.

There is also a third case that is more complicated. You will have an indexed used, but it's not really used to its fullest extent. For example, you have index equaling A more than 5 and B equaling 2. You may say, "Oh, the index is used and that's great," but in reality, only the key part of A more than 5 will be used for a lookup, and then everything else will be forced to filter data.

In the second case, you have similar situation. The first two key parts will be used, not all three, as we might expect. I believe this is a very important thing to know because it's very easy to get a mistake here thinking, "Oh, an index is used," and not really think about what key parts are being used.

As a rule, MySQL will stop using key parts in a multiple index as soon as it meets any real range. However, if you have an IN clause, then MySQL is still able to continue using further key parts.

The second way MySQL can use an index is for sorting. This can work in both cases. Like in this example, when selecting from players, we can use an index to sort the data. This is very important because if you don't have an index to sort your data, you will see the file sort which has to read through the whole table. It doesn't scale at all with the large data sets.

We can also combine using index for lookup for data and order by. In the second case, we are putting up the top scores for a given country. For the US, this can be sorted by index on Country,Score where the key part country will be used for the lookup and the score to efficiently sort everything for this country.

With sorting, it becomes more tricky to see how indexes can be used. First in a multiple column index you can sort by one column or two columns in the same order. You can also use equality sort in the first column and order by second, again in both directions; or you can have a range on the first column and sort by the same column. These are all situations where all the possible parts will be used.

There is also a case where an index cannot be used for sorting. For example, you cannot sort for a second key part of the index if you're not sorting by first part. You may have key A,B, but to order by B won't work. You also won't be able to use your index if you have a range on the first key part and then you sort by a second key part.

The index will be used for a range so it's all right in this case, but you'll have to do file sort for all the values which are A more than 5 in the given example which is very expensive. It also doesn't work with IN, very contrary to our previous example with our normal lookups. And finally, you have to know that you cannot sort by two columns in a different order. If there are multiple column indexes, you have to sort in the same order.

So again, following that out as a rule, you can’t sort in different order by two columns, and you can also use only an equality comparison for all the columns in the index which are not part of the ORDER BY. Otherwise, MySQL won't be able to use the index on order by. The next important optimizations that MySQL has is its use of the index to avoid reading the data. The data will be in the data file when you happen to use MyISAM tables or primary key in case of InnoDB tables.

We want to avoid doing that because that is an extra read, along with the data being much larger than the index. It doesn't present in the cache to the same extent. If you want to look up, for example, the status information for all orders for a given customer ID, we could possibly have a key on the customer ID alone, and the lookup will be quite efficient. It will find all customer IDs and go to the primary keys, continue the information about the order from the customer/status call.

This can be made substantially faster if it had a key on Customer_ ID,Status. With this example, MySQL will get all the data from the index alone, not going through the data. This is much faster and it's especially valuable when we have access, which traverses many indexes, because the access becomes more sequential and less random in different location in the data files.

MySQL can also use index to optimize MIN and MAX aggregate functions. But note, only those two aggregate functions. You cannot optimize average or any other function. True aggregate functions actually need access to all the data. You can have an index for things like select MAX(ID) from table or even, let's say, MAX(salary) for each department. That can be optimized for an index in both example.

In the second example, you can check that, even for very special extra information, which is called using index for group-by.

We have spoken about using index in the single table queries. Well, beyond those, we often use joins in the SQL language. This is what SQL is good at, although it has to join table with data from different tables.

How do indexes work? When we look at the joins, MySQL uses the nested loops methods for the majority of the queries. It uses a modified and optimized version of an algorithm, but logically, that is what it uses, so it picks one of the tables. It has chosen comments in this example. It scans to find all the comments, and then for each of them it performs the lookup on the second table and so on and so forth.

The best way to handle this query is to have an index on the posts table so we can find all the posts with author equals people, and then have an index on posts.ID and comments so we can look up all comments for a given post.

There are a couple of things to consider here. First, for joins, it's very important to have your joins indexed well. If your joins are not indexed very well, then it's likely to be a bomb just waiting to blow up as the data size increases. It can get really bad, really quickly.

If you can't have a join query, which is using an index as well, you may have to re-design and re-architect things, but don't leave them if you expect an application of any significant size. The other thing you have to consider is that you really only need to do the indexes in the direction of the lookup. We are going to look up from the author table, to the comments table by the post_ID.

The author table index is necessary. The index on is not needed in this case to optimize this query, even though it is still likely to be there as a primary key. There are many situations where I notice people see all the comments that needed in the joins indexed in all tables, and that produces a lot of unnecessary indexes. It also creates extra-hard work for MySQL optimizers to research all the combinations how tables can join.

Starting with MySQL 5.0, MySQL can use multiple indexes for the same table. That is what we call an index merge. It is important to know there are two very similar looking but very different variants. One is when you have two equality conditions with A&B query, and in this case MySQL may be able to use two indexes on A and B separately to optimize a query. The index on A,B which combines both the columns can be used to resolve both the conditions in the where clause from the same index, which will be much, much better.

If you have an "or" within those conditions, then you have to have two indexes on A and B and use a merge within those. It's as good as it gets for such query. And it's worth to know, but in this case, if you have a composite index A,B, in a query like this, this index won't be used. It won't be used at all, so keep that in mind.

Another interesting feature that MySQL provides is called a prefix index, which allows us to build an index on the leftmost prefix of the column. This can be very helpful if we are doing a lot of prefix searches. Even equality searching at the leftmost prefix is selective enough. It also can be very helpful for blob and text columns because if you want to index those, you must specify a prefix.

Prefix indexes cannot be used as a covering index, so keep that in mind. If if you change to a prefix index, you will lose the ability to use that as a covering index and it should be worth it.

In choosing the right prefix length, common wisdom dictates that the prefix should be selective. Here we have a table with about 2 million rows, and you can see we have a total about 1 million distinct titles in this table, and 960K different titles, different prefixes of 20 chapters, and about 600 something of which have a first prefix of 10 characters.

You could say that running 20 characters is very good. It's almost as selective as the whole table, while it will take half of that space or even less. However, it is not enough to check for that. You should check for outliers. Checking this, we can see there are not a lot of outliers. The most common prefixes and the most common titles are very close by, but that's not always the case.

For example, if we're speaking about indexing cities in the world, we can consider that having a prefix with the first four characters would be sufficient to provide us a good uniqueness, and in average, that is true, but only if you don't think about other very common four characters sequences such as San and New . There are a lot of cities that can fit into those categories. This is why such a prefix wouldn't have been very helpful. Searching San Francisco or San Fernando would have to scan through a lot of index entries to find the proper match.

There have been changes in MySQL 5.6. What exactly has changed and what should we consider for index design? There are a lot of optimizer improvements in MySQL 5.6, but most of them don't really need you to do anything in terms of your index design structure. It applies the new query execution methods which commonly can be based on the same index structure as you had for the previous MySQL version.

It is important to note a very new meaning in MySQL 5.6; the join_buffer_size variable. Previously, that was kind of a throwaway variable, which only made sense in doing the joins if there were no indexes. In in MySQL 5.6, it's also used for some of the newly advanced optimization such as batch key access, and so selecting a high value, something like 32MB, often can make sense.

The most important thing I think we should talk about for index design consideration is a new method of data access called Index Condition Pushdown. Index Condition Pushdown, (also called ICP) is when a MySQL optimizer can push certain conditions back to the storage engine to optimize. This is done instead of fetching the rows from the storage engine and then doing the filtering on the extra fields on the MySQL server level.

ICP gives us a more flexible covering index than ever before. MySQL 5.5, had a sort of an unknown approach to using the index for reading the data. A current index required that all reference columns be there, or the index would be used only for look up, and then rows are attached and other conditions examined.

This all changed in MySQL 5.6. An example is if you have a query, you we select A from the table, and have a clause B equals 2 C like certain search pattern. In MySQL 5.5 and below we have two approaches to the indexes here. One is traditional. We can believe that we'll just put the index on B to resolve with a cause to close clause, and then everything else will be checked after rows B and C from the search engine. We could also create in this B, C and A, which are the current indexes that include all the columns included in the query. This can be used in MySQL 5.6 as well, but you can also create an index on B and C which can be very valuable, for example, if Column A is a very long column.

What can happen is that B will be handled as a normal index lookup, then filter in C where a full-text sort will happen on the storage engine level before passing data back to MySQL. Then only rows that match both those conditions, which are few, will be passed back so the A column can receive the text from a primary key.

Another optimization can happen to the limitation of the index, if you have the indexes A, B and C and an index on A where A equals 5 and C equals to C only. In MySQL 5.6, you will also use Index Condition Pushdown, so only the index engines will have to be examined for A equals 5, and filtered wherever C equals to or not. It does not equal to reading all of the rows, which can be significantly cheaper.

The next question that needs addressing is how does MySQL picks the index to use. MySQL picks indexes dynamically for every query execution. Even if you're using prepared statements that depend on the data state and the exact constants you pass, the query plan may differ. MySQL optimizer has two ways to estimate the number of rows matched and pick the good plans.

Indexes that know the actual constants for index lookups will go and perform the so-called index dive in the table to estimate the number of rows possible. If that's not possible, which typically happens for tables that are not a first table in the join order, it still uses cardinality information instead, and this is something that analyzes table updates.

It’s worth noting that the big variable that MySQL tries to optimize is the number of rows it has to examine. That is not really what is used inside the code. There are a lot of various heuristics which have been accumulated for years. MySQL, for example, considers the primary key as special. It understands the benefits of using a covering index for the full table scan because it is faster.

It is really hard to remember what exactly impacts the MySQL decisions, especially because everything is different between MySQL versions. My advice would be to check the plans that MySQL actually uses in production on your real data. Don't think you actually know how MySQL will execute different kind of queries.

This is accomplished with Explain. Explain is a very great tool, but explain itself will require a presentation all on its own.

The next discussion point to address is the optimal indexing strategy for your application. Important to note is what you build, and whether this is considered all or at least the most performance critical queries in the application, not just one in form. We want all ‘where’ clauses and ‘join’ clause indexes wherever possible. If they are not possible, at least a more selective part can eliminate the largest amount of rows from consideration.

Generally, try extending an index if you can, instead of creating your own. We have discussed that it's cheaper. You should make sure to validate query plans and performance impact, because you may have some unanticipated impact, for example, things become very slow because an index became too large.

Let's say you have two queries that you have to optimize which are shown on the slide. Now, if I looked at those queries one by one, looking at Query No. 1, I would build an index in A,B, because these queries happen to be written in this order. If you look at two of those queries at the same time, I would put B and A in the same index because it executes both queries better.

So I wouldn't create an index of A,B and then a separate index on B, A. If everything is equal in terms of what kind of key parts you can cover, it's better to put the most selective key parts first, but if it's less important, then I actually set this up so that MySQL can use them in the best possible way.

Also note that you should not add indexes for non-performance critical queries because may indexes may slow the system down, and sometimes when you have a report that you are doing once a month, which may take a couple of hours, it's not worth optimizing that at the cost of your every-day real-time queries.

Finally, let me cover a couple of tricks which you can use to work around the MySQL optimizer limitation. The first one I call “enumerating ranges.” Let's say we have an index on A,B and we have something like select where A between 2 and 4 and B equals 5. As we discussed before, that you can conventionally use only one part of the index because after ‘between’ no further key parts can be used.

If you replace in that workload ‘between’ with all the exact values which are in that range, like 2, 3 and 4 in this case, then MySQL will use both key parts. MySQL is not smart enough to do that by itself.

You can actually extend this technique to add a fake filter. If you have a prefix column which is not very selective, like in this case, gender, and the majority of your queries both specify, for example, a gender and a city, in this case, but you also would like to run a query which don't specify the gender, then you can either use an IN clause, which specifies all different values which could be in a gender column, like in this case, for example, we talked about only the most popular genders out there. And this trick actually works with other things. It works with gender, status, Boolean types.

It can also work with things like states, for example. There are not too many different variants to put in. The third big limitation is dealing with sort. As we spoke about, the sort optimization ability is quite limited, so if I have to have some where clause which is more than equality on that first column, then I will get a file sort.

In many cases we need something like IN. What you can do is re-write this instead of IN as a union all query. If you take a look at this query plan, you will see a much more nasty looking plan than for a normal state. That query can often go many times faster than the original one. That's it for my presentation today, and Matt, I am ready for questions.

Matt Griffin:

Excellent. Thanks, Peter. We've got some great questions. Just as a reminder, if you have any questions, please post them here and we'll get to them with Peter. So let's just start off with a question here with regard to covering indexes.

The manner in where, group by, and order by clauses but not fields in select, does the order of the columns in the covering index matter or do you have to use the first one somewhere in the query to be able to leverage the others in that covering your composite index?

Peter Zaitsev:

Which order they want to use the columns for a covering index? First, you want to put the columns that you're actually going to use for your lookups and the sorting, if there are such columns. After those columns, you can append other columns in any order. It doesn't matter in this case when we speak about a non-specific query.

But remember, often you may have the same index which is used by some other queries, so I typically would see what kind of order of the columns may help benefit in some other queries which I either have right now or may have in the future which could use those other columns for lookups or sort or something like that.

Matt Griffin:

This, I think, maybe relates to your second trick. So in the beginning of the presentation you spoke about different types of indexes. This question is about the bitmap type index, and let's say that there are five or ten different values and a million rows of data, and Oracle has the bitmap index, but MySQL, is B-tree only available? Would it be too costly? And then my follow up is would this be a good application of your second trick?

Peter Zaitsev:

MySQL at this point doesn't have the bitmap indexes, so there are mounds of workarounds which you have to use to alleviate those depending on exact conditions, and that's maybe one of them.

Matt Griffin:

So when you were discussing the costs of indexes a question came up. What do you mean when you say extending index is better than adding a new one; can you give an example?

Peter Zaitsev:

Let's say you have a query, you have an index on just Column A and then you have a query which has a lookup on A,B. A equals to 5 and B to 5. You really have two choices, right, to say, "Oh, this query needs index on A,B why don't they go ahead and create that?" And in this case, in the end, you'll have index on A and index on A,B while most likely extending the Index A to making that index to be A,B would be better instead.

And this is a very common problem I see with MySQL DBAs and developers is don't look at what kind of existing indexes exist already and just when they see a query which is not running well, they just add a new index for a query.

Matt Griffin:

As a follow up to that, other than explain, can you recommend any other tools for index uses optimization?

Peter Zaitsev:

Well, I would say the explain is a main tool. Now, in addition to that, I think it's very helpful to use other tools to observe query performance. So you can use tools like pt-query-digest or Percona cloud tools where you can essentially see every query execution time before and after you implemented the change which can be very valuable because it allows you to see in the production not only the impact on the query you are trying to optimize, but to other queries as well because you may have sometimes unexpected side effects.

You optimize this given query, but you have a certain insert that can be a lot slower or you have triggered some part, and some other queries start to have bad optimizer findings, and so on and so forth, so there's a lot of possibilities here.

Matt Griffin:

What is the rule of thumb for setting the value for join buffer size for systems with, say, over 16 gigs of RAM?

Peter Zaitsev:

The thing here is, it's not a question about the amount of RAM but also about how many queries you run concurrently, and how much data it goes through. So batch key access, what it gives us access optimized by certain keys and accessing them in a more sequential order. So even a query which accesses, for example, 1 million of rows, then that means it will have a chance to sort those keys to access in that order, and this means it may be able to use up to 16 or 32 megabytes of memory and not more.

So the two factors you have consider first is how many keys, really how many rows that your query can access, and so how much the join buffer size could possibly use, and the second is how many queries I plan to run concurrently because if you have a lot of those queries going, even parallel, then you have to account and provision enough memory to allocate that different buffer size for all of them as well as have enough memory for entire buffer.

Matt Griffin:

What are the positives and negatives or just overall considerations when indexing a varchar or a text column, in general when is it a good idea to do and when is it a bad idea to do when your balancing performance and disk space, and are those the primary concerns?

Peter Zaitsev:

The thing here is if you need a query which looks at a large amount of data to run faster and the where clause happens to be a varchar, it has to have indexes, and their storage impact often becomes a secondary concern which can be dealt with in different ways. We can, for example, choose not to create the full index, but only create the partial index.

We can also choose things like enable InnoDB compression if you need data space or even use a different storage engine such as TokuDB which can allow us to fit much more data in the same storage space.

Matt Griffin:

If anyone has any more questions, please post them here. We will get to the rest of the questions, we had a lot of questions, with a follow-up blog post, and with that, I'll hand it off to Alisa. Thanks, Peter.

Alisa Baum:

Thank you, Matt and Peter. Great job today. Peter there are tons of questions and just like Matt said, I will make sure Peter has every last one of the questions in his hands to address in a follow-up blog post. Thank you everyone for joining us today and next week we hope that you join us for the webinar Nine Percona Toolkit Tools that You Can't Live Without. Would love to see you there. And with that said, thank you very much for your time today, and we hope to see you in the future. Take care.

Related Videos: Percona MySQL Webinars, Best Practices, Operations, Query Tuning and Optimization, Search and Indexing, MySQL 5.6

DBdeployer, the Community Edition - Giuseppe Maxia - Percona Live ONLINE 2020
The State of ProxySQL, 2020 Edition
Databases on Kubernetes - Rick Vasquez - PLO October 2020
Disaster Recovery and High Availability: The Concepts, The Mistakes, and How To Properly Plan For Failure
BPFTrace - Dtrace Replacement on Linux
MySQL Schema Review - Michael Benshoof - Percona Tech Days
Percona Server 8.0 features you always wanted but were afraid to ask for - MySQL Tutorial