Billion Goods in Few Categories: how Histograms Save a Life?
We store data with the intention to use it: search, retrieve, group, sort... To perform these actions effectively MySQL storage engines index data and communicate statistics with the Optimizer when it compiles a query execution plan. This approach works perfectly well unless your data distribution is not even.
Last year I worked on several tickets where data follow the same pattern: millions of popular products fit into a couple of categories and rest used the rest. We had a hard time to find a solution for retrieving goods fast. Workarounds for version 5.7 were offered. However new MySQL 8.0 feature: histograms, - would work better, cleaner and faster. This is how the idea of the talk was born.
I will discuss
- how index statistics physically stored
- which data exchanged with the Optimizer
- why it is not enough to make correct index choice
In the end, I will explain which issues resolve histograms and why using index statistics is insufficient for fast retrieving of not evenly distributed data.