I had an interesting case recently. The customer dealing with large MySQL data warehouse had the table which was had data merged into it with INSERT ON DUPLICATE KEY UPDATE statements. The performance was extremely slow. I turned out it is caused by hundreds of daily partitions created for this table. What is the most interesting (and surprising) not every statement is affected equally as you can see from the benchmarks above:
I got the following test table created:
CREATE TABLE `p10` (
`id` int(10) unsigned NOT NULL,
`c` int(10) unsigned NOT NULL,
PRIMARY KEY (`id`),
PARTITION BY RANGE(id) (
PARTITION p100000 VALUES LESS THAN(100001),
PARTITION p200000 VALUES LESS THAN(200001),
PARTITION p300000 VALUES LESS THAN(300001),
PARTITION p400000 VALUES LESS THAN(400001),
PARTITION p500000 VALUES LESS THAN(500001),
PARTITION p600000 VALUES LESS THAN(600001),
PARTITION p700000 VALUES LESS THAN(700001),
PARTITION p800000 VALUES LESS THAN(800001),
PARTITION p900000 VALUES LESS THAN(900001),
PARTITION p1000000 VALUES LESS THAN(1000001)
I used MySQL 5.1.41 for my tests increasing buffer pool to 1G and log file size to 128M so all data is comfortably in memory.
I ranged number of partitions from 1 to 1000 and loaded the table with 1000000 of sequential values from 1 to 1million (the C column was set same as ID column) using bulk insert statements – 1000 rows in each.
Loading the data was taking 9 seconds for 1 partitions, 10 seconds for 10 partitions, 16 seconds for 100 partitions and 23 seconds for 1000 partitions, which means it slows down 2.5 times as number of partitions increases to 1000.
This regression is somethat surprising as in reality only 1 (max 2) partitions got data inserted to them per insert statement.
As I tested the UPDATE path on INSERT OF DUPLICATE KEY UPDATE (adding ON DUPLICATE KEY UPDATE set c=c+1 to my bulk inserts) the regression became even larger.
1 partition insert took 50 seconds to complete, 10 partitions 52, 100 partitions 72, 1000 partitions 290 seconds. The performance loss with large number of partitions is about 6 times in this case and the pattern is a bit different – the slowdown becomes drastic as we go from 100 to 1000 partitions – about 4 times compared to only 1.5 times slow down for the same case with just insert.
The difference grows even larger if we remove index on column C – the “UPDATE” part of INSERT ON DUPLICATE KEY UPDATE completes in 22 seconds for 1 partition and 250 for 1000 partitions – which is over 10x difference.
Both MyISAM and Innodb are affected by this issue. The Update path for MyISAM without indexes took 10 seconds for 1 partition vs 52 seconds for 1000 partitions (the table_cache was warm for this test)
I had couple of suspects in this case – is this the “setup” overhead of statement opening all partitions for execution or is this “per row” overhead. To check this I ran the statements with different number of rows in the batch. With 100 rows per batch and with 10000 rows per batch performance was not significantly different from 1000 rows per batch so I conclude this is per row overhead.
This test gives other rather interesting data point – it looks like updates in Innodb for CPU bound workloads can be as much as 5 times slower than inserts – which is quite a difference.
CONCLUSION: This is interesting topic and I should do more tests on this. In the meanwhile the data point is simple – be careful with number of partitions you use. Creating unused partitions for future use may cost you. From this benchmarks the performance remains reasonable with about 100 partitions (which is reasonable number for most applications) though I would expect numbers to be significantly workload dependent.