November 28, 2014

Using CHAR keys for joins, how much is the overhead ?

I prefer to use Integers for joins whenever possible and today I worked with client which used character keys, in my opinion without a big need. I told them this is suboptimal but was challenged with rightful question about the difference. I did not know so I decided to benchmark.

The results below are for MySQL 5.1.18 using MyISAM and Innodb tables. This time unlike other benchmarks I decided to do Join not on primary key and have query to read data for both tables. If the query would be index covering I would expect us to see different ratio. The query I use here is constructed to stress out join code while avoid sending data to the client Do not try to find any good meaning for query or schema. For joins which fetch just few rows difference is likely to be less as the join code itself is likely to be responsible for less portion of response time.

OK. Lets start with first simple MyISAM table and join query performed on INT fields:

So what about Innodb ? Innodb executed the same query in 2.9 seconds which was a bit disappointing for me as I expected MyISAM to be slower due to amount of extra system calls it has to read row data from OS Cache.

So what if we convert i and j columns to varchar while sticking to utf8 character set which we had as default ? Joining on Char columns completes in 4.5 seconds on Innodb which is about 50% slower compared to Joining on Int, for MyISAM however the time became 11.0 seconds which is over 6 times slower than joining on the integer.

In fact this was expected as MyISAM uses key compression for varchar columns so random key lookups become significantly slower. I tried to set pack_keys=0 which typically helps in similar cases but it looks like there is regression bug and this setting does not work any more.

The next test I decided to do is to convert Innodb table to latin1 character set. I was expected this to shorten some internal buffers MySQL has to allocate for key comparison as well as have comparison function significantly faster. The reason for this test was – latin1 encoding is enough for most character based keys – uuid, sha1/md5 based etc.

Latin1 encoding indeed gave significant improvement to 3.5 seconds which is just 20% slower than integer based join for Innodb.

Finally I decided to check if using longer strings slows down things significantly and so I replaced all numbers with their sha1() hashes which still made eq join to run the same ways but gave me much longer keys. The performance dropped down to 6.1 seconds which makes it over 2 times slower compared to integer based join.

So how do I read these results ?

  1. CHAR keys are indeed slower for joins compared to integer keys
  2. Performance degradation can range from few percent to couple of times for Innodb tables
  3. MyISAM Tables may suffer significantly if key compression is not disabled
  4. Joining on Shorter CHAR keys is significantly faster than Long keys
  5. Latin1 (I guess any simple encoding) is significantly faster for joins compared to UTF8

These tests were preformed for in memory tables, for IO bound workload results are likely to be different as longer indexes expected to have much worse cache fit, especially if you keep them unpacked.

About Peter Zaitsev

Peter managed the High Performance Group within MySQL until 2006, when he founded Percona. Peter has a Master's Degree in Computer Science and is an expert in database kernels, computer hardware, and application scaling.

Comments

  1. Sergei Golubchik says:

    Did you also try latin1 on MyISAM tables ? It’s interesting to see the difference between utf8 and latin1 for MyISAM. And when you say “latin1″ what collation do you mean ? Not latin1_german2_ci (which doesn’t allow byte-wise comparison), I hope ?

  2. peter says:

    Sergey,

    I did not test that because I could not disable packed keys for MyISAM and I assumed this would be majority of the overhead. Did the test right now and we get MyISAM with latin1 encoding completing in 9 seconds. Better than 11 seconds but still very slow.

    I used default collation for latin1 and utf8 character sets.

  3. viktors says:

    Good post! It would be interesting how BIGINT keys compare to INTs – often tables using BIGINTs are the same tables which require the most performance. Also, when comparing SHA1 columns, did you try BINARY/VARBINARY to prevent case-insensitive comparison (which could lower performance)?

  4. peter says:

    Viktor,

    Thank you for ideas. In fact results are very interesting.

    First about VARBINARY. I compared VARBINARY to using LATIN1 encoding for “normal” join with short columns containing integers as a strings and LATIN1 was even a bit faster for unknown reason. Latin1 averaged 3.51 seconds while VARBINARY 3.56 seconds. The difference is so close I’d call it almost the same but still it is not faster.

    Now BIGINT was complete surprise for me and someone in MySQL may want to profile it in more details. Changing column types from INT to BIGINT slowed down join speed for MyISAM tables from 1.7 seconds to 3.8 seconds, which is slower than Innodb with chars. Note the tests are done on x86_64 platform so 64bit should not cause so much slow down.

    Innodb also is a bit slower with BIGINT at 3.2 seconds compared to 2.9 seconds with INT columns while this is more understandable as row size is increased significantly in this test case with short rows.

    Now for SHA1 based joins I also tested few storage options. First one is LATIN1 which improved join speed from 6.1 to 4.4 seconds for Innodb tables. Using VARBINARY instead gave about same performance.

    Yet another trick I tried is using “compressed” sha1 storage is by running unhex() on the sha1 data. This is done with VARBINARY type of course. This improves performance from 4.4 sec to 3.9 sec for Innodb tables which is quite decent gain.

    The last test I’ve done is changing varchar(10) to varchar(255) in my “short” varchar join tests for Innodb table with utf8 encoding. This caused some degradation slowing down join speed from 4.5 seconds to about 4.7 seconds

  5. vinjvinj says:

    Did you try using char instead of varchar? Most guid used in the db are always of fixed length.

  6. “Changing column types from INT to BIGINT slowed down join speed for MyISAM tables from 1.7 seconds to 3.8 seconds”

    How much data? Could it have been just reading them off disk since bigints are twice as large?

    I’m going to have to play with this since we use latin1 SHA1 varchars for PKs.

  7. peter says:

    Kevin,

    This is CPU bound completely in memory workload. The table had some 250.000 rows as you can see in explain taking 20MB or so.

  8. peter says:

    I tried running VARCHAR vs CHAR for sha1 ids

    The join took 3.90 seconds for CHAR(40) and 4.05 seconds for VARCHAR(40) using latin1 encoding Innodb tables. So there is indeed benefits from using CHAR type for this type of column while it is not significant.

  9. Pajz says:

    Hi Peter may be your computer has low memory.
    This comment is base in My SQL documenration try again to retreive the data..

    Before MySQL 5.0.3, a CHAR or VARCHAR column with a length specification greater than 255 is converted to the smallest TEXT type that can hold values of the given length. For example, VARCHAR(500) is converted to TEXT, and VARCHAR(200000) is converted to MEDIUMTEXT. Similar conversions occur for BINARY and VARBINARY, except that they are converted to a BLOB type

  10. peter says:

    Pajz,

    I’m not sure what is the point of this documentation quote.
    But regarding memory – there is enough memory to fit all database in all cases so disk IO is not an issue here.

  11. Alexey says:

    Is VARBINARY same as VARCHAR/binary?

  12. Rob says:

    any Mysql Podcast other than She BA Podcast that are good?
    sorry this is off topic I know….

    Cheers Rob

  13. rafael says:

    how to convert MYSQL ENUM (char type) into ENUM (int type)

  14. Rakesh says:

    I believe this is true from across the database environments; be it MS-SQL or MySQL or ORACLE or DB2 or any other.

  15. Michael Evans says:

    I know it’s been a while since this was posted, but I don’t believe the comparison was done fairly. An int(10) is only 4 bytes long, whereas char(10) is truly 10 bytes. Just from the size difference alone I would except char(10) to be slower because at the very least you’re comparing 2.5 times as many bytes–just like switching from int to bigint slowed things down because bigint is 8 bytes.

    I wonder how a char(4) would compare to an int, since at least then you’re comparing the same number of bytes.

  16. Michael,

    The comparison was done for specific case – people use CHAR (by mistake) to store numeric data. Because 32bit int goes up to 4 billion it requires 10 digits to store hence CHAR(10). I would welcome you to try your own benchmarks with short char size and share results so we can see how much of the difference is due to type and how much due to length.

Speak Your Mind

*