Brian recently posted an article comparing UUID and auto_increment primary keys, basically advertising to use UUID instead of primary keys. I wanted to clarify this a bit as I’ve seen it being problems in so many cases.
First lets look at the benchmark – we do not have full schema specified in the article itself so it results are not absolutely clear but we already can have certain conclusions.
Data size is very small. What is the biggest problem with UUID ? It is the fact inserts are going in random locations in index tree which will require a lot of IO in case of index tree not fitting into memory. This is not simply the case of 32 bytes vs 4 bytes for key value – if you would use integer key and insert data in random order you would have the same problems.
In fact if you store UUID in binary form you can bring it down to 16 bytes so size is not really the problem.
What is about Secondary Keys ? For Innodb tables UUID have extremely poor impact on your secondary key because all rows are referred by primary key value. This especially hurts for a lot of short integer keys because they can become many times longer.
Parallel Inserts UUID are often advertised as allowing to spread the load from single buffer page which auto_increment is constantly hitting, this is true but at large date size it is way overturned by BTREE buffer. The most efficient approach here is not to use auto_increment but use certain partitioned sequences, for example you can have 256 growing sequences (with high byte used for sequence number) which will have 256 “hotspots” – good enough to make load parallel but still small enough to be well cached. It is also worth to note Innodb (which is storage engine which usually considered best for parallel insert/updates) has pretty much table level locks when it comes to auto_increment columns, which is however completely separate problem which can be fixed in MySQL 5.1
Data Clustering This again applies to Innodb tables aspect of primary key selection – you often can gain a lot by selecting primary key which provides data clustering which matches your application needs. It may be (user_id,msg_id) in some cases but in many auto_increment already gives us what we want – if we access “recent” items more frequently and data set is large auto_increment would work much better than UUID because UUID will have recent data scattered all across.
Lookups Later in comments Brian mentions the point of benchmarks was rather lookup by primary key speed. Lookup speed can be similar or can vary a lot. If we have completely random lookups it should rather close for data in memory case or data on disk case with auto_increment having advantage when UUID larger BTREE does not fit in memory any more and so more data reads is required. However for other distributions situation can be different, see previous point about clustering.
Benchmarks In general I stand the same point – be careful applying benchmarks you find in the Internet to your own case, it will be misleading more frequently than not, especially if you do not have technical depth to understand all implications and assumptions. I promised to publish some insert benchmark for this case with larger data size so here they are:
I’ve created MyISAM tables containing just integer auto_increment primary key and containing char(36) value and used for UUID primary key and when I populated it with 268.435.456 rows (large enough for that 512M box to be disk bound). For auto_increment key load process took 1 hour 50 minutes giving load speed of 40305 rows/sec. For UUID process took over 12 hours and is still going. From MySQL status I can see it is loading about 200 rows/sec and the it is still slowing down a bit as key file growths. So in this little case we have about 200 times performance difference which is worth to consider
Can UUID be handled efficiently ? They would not be as efficient as Ints simply because size matters but you can do them pretty efficient by using binary compression. Plus they should be stored in optimized sort order to minimize how random they are –
there is timestamp based part in UUID which has similar properties to auto_increment and which could be used to have values generated at same point in time physically local in BTREE index.
This is actually the reason why UUID can do much better than SHA1 as Kevin proposes in the same post – hashes if it is SHA1, MD5 or CRC32.