Thanks to everyone who attended, and especially to folks who asked the great questions. I answered as many as we had time for during the session, but here are all the questions with my complete answers:
Q: Can you compare the use of subqueries/multiple joins vs. multiple queries (e.g. temp tables)?
For performance, it’s hard to make a general rule, because there are different types of subqueries (correlated, non-correlated, derived tables, scalar subqueries) and we saw in my presentation that sometimes these perform worse and sometimes better than another equivalent solution.
If you’re talking about using a temporary table to store an interim result set, and then use that to ultimately produce the result you want, that’s another technique that I see used, although it’s less common. It can be helpful to simplify the development of a report, but whether it’s a benefit to performance is a case-by-case decision.
There are other drawbacks to using temporary tables in this way, for example when using replication, a temporary table could vanish if the slave restarts and then subsequent queries against the temp table fail.
Q: Why doesn’t the query optimizer take care the execution decisions? Since they are all logically the same, I believe it should all translate to the same execution plan. (I know it doesn’t though.)
Why don’t you try writing a SQL query optimizer and see how easy it is? 😉 Generative grammars like SQL can produce an infinite variety of queries. The query optimizer analyzes queries for common patterns the MySQL designers know can be improved. But there’s a finite amount of engineer-years spent developing and testing the query optimizer. The developers have to make a decision about which query patterns the optimizer recognizes, and which are left to the developer to refactor manually. There could also be some cases where optimizing a query automatically would be more costly than just running the query in a suboptimal form.
Q: Doesn’t the primary key solution for random selection only work when the IDs for movies are distributed uniformly over the range 1..MAX(id)?
Yes, there’s a risk if you have irregular distribution of matching rows that you’ll get skewed results. For instance, if there’s a gap of id’s between 47000 and 50000 with no movies, but your random number generator picks values in that range with equal frequency as any other range, then the movie immediately following the “gap” will be picked more frequently.
Nearly every solution for randomly choosing rows involves some compromise, either of performance, or of accuracy of randomness. The ORDER BY RAND() solution is known to have poor performance, but it returns a better random choice.
Another workaround may be to add a column to the table, and populate the rows you want to choose (movies, in this case) with values known to be consecutive. Then you could choose a random value, look up the row with “=” instead of “>” and be guaranteed to find exactly one match. But the tradeoff of this solution is that it requires storing another column, and reinitializing the sequence after making certain insert/update/delete operations. If you have a dataset that changes infrequently, then this might be a good tradeoff.
Q: Is the tuple comparison independent of sorting order? Wouldn’t it be a problem if the keywords were returned in a different order from what you specify in the query?
The tuple comparison you’re referring to is this example:
WHERE (k1.keyword, k2.keyword, k3.keyword) = ('espionage', 'nuclear-bomb', 'ejector-seat');
This is okay because each correlation name (k1, k2, k3) is an index lookup. Look at the EXPLAIN output — it accesses these first, by looking up the keywords you specify. It doesn’t matter what order you specify the keywords in this tuple, it is equivalent to this expression:
WHERE k1.keyword = 'espionage' AND k2.keyword = 'nuclear-bomb' AND k3.keyword = 'ejector-seat';
The AND operator is commutative, so the order of these terms doesn’t matter.
However, if I were using the tuple syntax to look up values against multiple columns in a compound index, then the order would matter, and it would have to match the order of columns in the index. Example:
WHERE (last_name, first_name) = ('Karwin', 'Bill');
Q: On the Dynamic Pivot, the straight join, can you explain more about why you thought about doing that and why it helped?
When I did not use the STRAIGHT_JOIN, the query optimizer reordered the tables. It seemed to prefer an index-scan of 7 rows in the
kind_type table to be first, then look up matching rows in
title by a secondary index. But the result was that it created a temporary table to count the movies per production year for each kind_id.
It was more efficient in this case to force MySQL to scan the
title table first, grouping by kind_id in index order. This made the first table in the EXPLAIN seem like it was scanning more rows. But by avoiding the temporary table, and making lookups to the
kind_types table by primary key, the result was a query that took half the time.
An important conclusion of my presentation today is that sometimes you find unexpected differences in performance like this, so it pays to test all different solutions, and measure the resulting performance!
Q: Bill, back to the query where you were trying to determine the last episode of each TV show. Couldn’t you have used a SUBQUERY to fetch and compare MAX(episode_id) as an option there?
Yes, I think what you’re referring to is what I covered as the Derived-Table Solution on slides 55-59. Apologies if that wasn’t clear, because I didn’t present the details of all the tables and columns. It turned out this solution was 150x faster for this case, so your suggestion is a good one!
Thanks again for attending my webinar! Here are some more tips: