Suppose you have turned on innodb_file_per_table (which means that each table has its own tablespace), and you have to drop tables in a background every hour or every day. If its once every day then you can probably schedule the table dropping process to run during off-peak hours. But I have seen cases where the tables had to be dropped more frequently (like every other hour), or when there was no such thing as off-peak hours, in such cases you need consistent performance. But dropping tables is known to cause stalls that exhibit themselves as reduced QPS during the table drops or stalls. Percona Server since version 5.1 has a feature know as “Lazy Drop Table” that alleviates the problem to a great extent but does not get rid of it completely. In the new releases of MySQL (versions >= 5.5.23) work has been done on reducing the stalls caused by DROP TABLE, these have been done in the form of fixes for bug 51325 and bug 64284.
So what is this blog post about? In this blog post I am going to give a high level picture of how Lazy Drop Table works and how the new DROP Table works in MySQL >= 5.5.23. After the explanation I will share the results of benchmarks that I ran to compare both the implementations. Note that whatever we are discussing here holds for when you are using innodb_file_per_table.
Before going into showing how Lazy Drop Table and the new DROP Table in MySQL >= 5.5.23 work, I would like to mention quickly, that how drop table in MySQL < 5.5.23 used to work. When dropping the table, a mutex on bufferpool would be held for the entire operation and then two scans of the LRU list would be done invalidating pages that belong to the table being dropped. Obviously this is going to take time which depends on the number of pages in the buffer pool at that time, and the longer this takes, the more stalls you have because, importantly, the bufferpool mutex is held for the entire duration. Now let's keep this old behavior of DROP TABLE as a baseline, and let's continue to look at the implementations.
Lazy Drop Table in Percona Server
The main function that is responsible for cleaning the bufferpool in the event of drop table is
space_id is the id of the tablespace corresponding to the table being dropped. Following are the steps performed in this function:
- Take a mutex on the LRU list of the buffer pool
- Scan through the LRU list and for each page in the LRU list:
- If the page belongs to the tablespace being dropped, then set a flag in the page structure marking the page as belonging to a tablespace being dropped
- Exit the mutex on the LRU list
- Take a reader-writer shared lock on the latch protecting the Adaptive Hash Index (AHI)
- Scan the buffer pool and for each page in the buffer pool
- If the page has a hash index built on it, (meaning the AHI has an entry for this page):
- Release the shared lock on the latch protecting the AHI
- Lock the page in exclusive mode
- Remove all entries in the AHI pointing to the page
- Release the lock on the page
- Take the reader-writer shared lock on the latch protecting the AHI again
- Release the shared lock on the latch protecting the AHI
Drop Table in Oracle MySQL >= 5.5.23
The main function that is responsible for cleaning the bufferpool is
buf_LRU_flush_or_remove_pages(space_id, BUF_REMOVE_FLUSH_NO_WRITE). Here
space_id is the id of the tablespace corresponding to the table being dropped and
BUF_REMOVE_FLUSH_NO_WRITE is a flag that is passed to this function meaning that only the entries in the flush_list should be removed. Following are the steps performed in this function:
- Loop while all dirty pages belonging to the tablespace (being dropped) have been removed from the buffer pool
- Take the buffer pool mutex
- Take a mutex on the flush list of the buffer pool
- Scan the flush list and for each dirty page in the flush list
- Remove the page from the buffer pool if the page belongs to the tablespace being dropped
- If we have removed 1024 dirty pages from the buffer pool thus far we release the buffer pool mutex and the flush list mutex for some time
- Exit the flush list mutex
- Exit the buffer pool mutex
- Try to force a context switch by advising the os to give up remainder of the thread’s time slice (this is going to let other threads do things on the buffer pool and prevents the buffer pool mutex from being kept for long)
- Take the buffer pool mutex and the flush list mutex again
- Release the flush list mutex
- Release the buffer pool mutex
Ok, now that I have shared a high level picture of both the implementations let’s take a look at the most important and major differences. You can see that the most important and the major differences in the two implementations is the use of buffer pool mutex in Oracle MySQL, this mutex is not used by “Lazy Drop Table” implementation. While the MySQL’s version of drop table relies on taking the buffer pool mutex and a mutex on the flush list, “Lazy Drop Table” implementation relies on taking a mutex on the LRU list (which is a list of pages ordered by their last access time) and a shared reader-writer lock on the Adaptive Hash Index latch. The buffer pool mutex is a major point of contention as it is taken at a lot of different places, while the affect of LRU list mutex is cheaper in comparison. However, a lock on the AHI latch can cause contention for writes, since its a shared lock, so while read queries can still access AHI, write queries would have to wait till the latch is unlocked.
Now about the improvements in Oracle MySQL’s implementation. The good thing is that there is no scan of the LRU list, and pages belonging to the tablespace being dropped are left alone in the LRU list and are evicted by the normal LRU process when these pages age towards the tail of the LRU. The other good thing is that in Oracle MySQL’s implementation the buffer pool mutex and flush list mutex are released after every 1024 pages being evicted. Both these things are an improvement over the previous implementation of DROP TABLE, but do they make enough difference. I feel that the buffer pool mutex and flush list mutex is still going to cause stalls. Well only benchmarks can give us the true picture.
I have tried to be realistic with the benchmarks. I have not tried to create-and-drop tables as fast as I can in the background, as that is not a realistic benchmark. In real life workload we will not be doing create/drop as fast as possible. Neither would I be looking at QPS of the entire benchmark as a single number, as what most people are often concerned about is uniform performance over a period of time. I would also like to point out one thing about benchmarks – we have been always advising people to look beyond average performance numbers because they almost never really matter in production, it is not a question if average performance is bad but what stalls and pileups you have.
Ok now some details about the benchmark.
First of all before running the benchmark, 30 tables were created and populated with 1000 rows, the tables all had a very simple structure:
CREATE TABLE t_x(i int(11) not null auto_increment primary key)
And during the benchmark, these 30 tables were then dropped at 30 second intervals in the background, while in the foreground sysbench was run for 15 minutes with 8 threads which would execute read/write queries against a 20M rows table:
sysbench --test=oltp --oltp-table-size=20000000 --mysql-table-engine=innodb --mysql-socket=/tmp/mysql_sandbox5524.sock --mysql-db=test --oltp-table-name=sbtest_20m --max-time=900 --num-threads=8 --max-requests=0 run
Note that the tables that were being dropped had been initialized before the sysbench run, so that the table creation time does not skew the results of the benchmark. During the entire run of the benchmark, all the queries were captured in the slow log by setting
long_query_time to 0. Then for analysis purposes the queries in the slow log were grouped by their timestamp, giving us a QPS no. for every second of the benchmark run. Finally, average QPS was taken per every five seconds.
The size of the dataset used is 4.6G, while the buffer pool size is 6G. The value of
innodb_max_dirty_pages_pct was set at 90, while the InnoDB log files were sized at 4G. The configuration and dataset sizes were chosen, so that the dataset completely fits in memory, and to prevent checkpointing from skewing the benchmark results. Prior, to running the benchmark, I also made sure that the buffer pool was completely warmed up with the dataset that sysbench will be using entirely in memory. Furthermore I filled up the bufferpool enough to make sure that the server represents one that is in active use. The buffer pool was filled up with 315K pages out of which ~100K were dirty.
Ok now some details about the hardware. The host has a CPU with two cores, 8G of memory and a software RAID5 array consisting of 4 disks.
The version of MySQL used for benchmarking pre 5.5.23 DROP TABLE implementation is MySQL 5.5.15, the version of MySQL used for benchmarking the new DROP TABLE implementation is MySQL 5.5.24, and the version of Percona Server used for benchmarking Lazy Drop Table is Percona Server 5.5.24 (innodb_lazy_drop_table=1)
Now so much for the details of the benchmark, lets see the graphs.
So we can compare the graphs above, and see that there are dips in QPS whenever table is dropped, MySQL 5.5.23 does show an improvement over MySQL version prior to 5.5.23, as the dips are not as regular and as low as in the old implementation of drop table, but the “Lazy Drop Table” implementation wins here. The dips are there in the case of “Lazy Drop Table” but the dips in QPS are not as low or as pronounced as compared to MySQL.
The new DROP TABLE related fixes in MySQL >= 5.5.23 have certainly reduced the performance hit, and the performance dips are not as pronounced as in older versions of MySQL. However, Percona Server with “Lazy Drop Table” still fares a lot better than MySQL. The dips are there, but not as low as in MySQL, though there is still room for improvement. I certainly like the idea of not scanning the LRU list at all during the DROP TABLE, and I think this is something that could be implemented in “Lazy Drop Table” to further improve the overall performance during background DROP TABLE operations.