Using CHAR keys for joins, how much is the overhead ?Peter Zaitsev
I prefer to use Integers for joins whenever possible and today I worked with client which used character keys, in my opinion without a big need. I told them this is suboptimal but was challenged with rightful question about the difference. I did not know so I decided to benchmark.
The results below are for MySQL 5.1.18 using MyISAM and Innodb tables. This time unlike other benchmarks I decided to do Join not on primary key and have query to read data for both tables. If the query would be index covering I would expect us to see different ratio. The query I use here is constructed to stress out join code while avoid sending data to the client Do not try to find any good meaning for query or schema. For joins which fetch just few rows difference is likely to be less as the join code itself is likely to be responsible for less portion of response time.
OK. Lets start with first simple MyISAM table and join query performed on INT fields:
CREATE TABLE `intjoin` (
`i` int(10) unsigned NOT NULL,
`c` char(10) DEFAULT NULL,
`j` int(10) unsigned NOT NULL,
KEY `i` (`i`),
KEY `j` (`j`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8
mysql> select sum(t1.i+t2.j+length(t2.c)+t1.j) from intjoin t1 left join intjoin t2 on t1.i=t2.j;
| sum(t1.i+t2.j+length(t2.c)+t1.j) |
| 10239901336 |
1 row in set (1.70 sec)
Here is explain if someone curious, it did not really change beside key lengths:
mysql> explain select sum(t1.i+t2.j+length(t2.c)+t1.j) from intjoin t1 left join intjoin t2 on t1.i=t2.j \G
*************************** 1. row ***************************
*************************** 2. row ***************************
2 rows in set (0.00 sec)
So what about Innodb ? Innodb executed the same query in 2.9 seconds which was a bit disappointing for me as I expected MyISAM to be slower due to amount of extra system calls it has to read row data from OS Cache.
So what if we convert i and j columns to varchar while sticking to utf8 character set which we had as default ? Joining on Char columns completes in 4.5 seconds on Innodb which is about 50% slower compared to Joining on Int, for MyISAM however the time became 11.0 seconds which is over 6 times slower than joining on the integer.
In fact this was expected as MyISAM uses key compression for varchar columns so random key lookups become significantly slower. I tried to set pack_keys=0 which typically helps in similar cases but it looks like there is regression bug and this setting does not work any more.
The next test I decided to do is to convert Innodb table to latin1 character set. I was expected this to shorten some internal buffers MySQL has to allocate for key comparison as well as have comparison function significantly faster. The reason for this test was – latin1 encoding is enough for most character based keys – uuid, sha1/md5 based etc.
Latin1 encoding indeed gave significant improvement to 3.5 seconds which is just 20% slower than integer based join for Innodb.
Finally I decided to check if using longer strings slows down things significantly and so I replaced all numbers with their sha1() hashes which still made eq join to run the same ways but gave me much longer keys. The performance dropped down to 6.1 seconds which makes it over 2 times slower compared to integer based join.
So how do I read these results ?
- CHAR keys are indeed slower for joins compared to integer keys
- Performance degradation can range from few percent to couple of times for Innodb tables
- MyISAM Tables may suffer significantly if key compression is not disabled
- Joining on Shorter CHAR keys is significantly faster than Long keys
- Latin1 (I guess any simple encoding) is significantly faster for joins compared to UTF8
These tests were preformed for in memory tables, for IO bound workload results are likely to be different as longer indexes expected to have much worse cache fit, especially if you keep them unpacked.