Storing UUID Values in MySQL

Please note, a more up-to-date follow-up post is here: Storing UUID and Generated Columns

A few years ago Peter Zaitsev, in a post titled “To UUID or not to UUID,” wrote: There is timestamp based part in UUID which has similar properties to auto_increment and which could be used to have values generated at the same point in time physically local in BTREE index.”

For this post, I’ve rearranged the timestamp part of UUID (Universal Unique Identifier) and did some benchmarks.

Many people store UUID as char (36) and use as row identity value (PRIMARY KEY) because it is unique across every table, every database and every server and allow easy merging of records from different databases. But here comes the problem, using it as PRIMARY KEY causes the problems described below.

Problems with UUID

  • UUID has 36 characters which make it bulky.
  • InnoDB stores data in the PRIMARY KEY order and all the secondary keys also contain PRIMARY KEY. So having UUID as PRIMARY KEY makes the index bigger which cannot be fit into the memory
  • Inserts are random and the data is scattered.

Despite the problems with UUID, people still prefer it because it is UNIQUE across every table, can be generated anywhere. In this blog, I will explain how to store UUID in an efficient way by re-arranging timestamp part of UUID.

Structure of UUID

MySQL uses UUID version 1 which is a 128-bit number represented by a utf8 string of five hexadecimal numbers

  • The first three numbers are generated from a timestamp.
  • The fourth number preserves temporal uniqueness in case the timestamp value loses monotonicity (for example, due to daylight saving time).
  • The fifth number is an IEEE 802 node number that provides spatial uniqueness. A random number is substituted if the latter is not available (for example, because the host computer has no Ethernet card, or we do not know how to find the hardware address of an interface on your operating system). In this case, spatial uniqueness cannot be guaranteed. Nevertheless, a collision should have a very low probability.

The timestamp is mapped as follows:

When the timestamp has the (60 bit) hexadecimal value: 1d8eebc58e0a7d7. The following parts of the UUID are set:: 58e0a7d7eebc11d8-9669-0800200c9a66. The 1 before the most significant digits (in 11d8) of the timestamp indicates the UUID version, for time-based UUIDs this is 1.

Fourth and Fifth parts would be mostly constant if it is generated from a single server. First three numbers are based on timestamp, so they will be monotonically increasing. Let’s rearrange the total sequence making the UUID closer to sequential. This makes the inserts and recent data lookup faster. Dashes (‘-‘) make no sense, so let’s remove them.
58e0a7d7-eebc-11d8-9669-0800200c9a66 => 11d8eebc58e0a7d796690800200c9a66

Benchmarking

I created three tables

  • events_uuid – UUID binary(16) PRIMARY KEY
  • events_int – Additional BIGINT auto increment column and made it as primary key and index on UUID column
  • events_uuid_ordered – Rearranged UUID binary(16) as PRIMARY KEY

I created three stored procedures which insert 25K random rows at a time into the respective tables. There are three more stored procedures which call the random insert-stored procedures in a loop and also calculate the time taken to insert 25K rows and data and index size after each loop. Totally I have inserted 25M records.

    • Data Size
      Horizontal Axis – Number of inserts x 25,000
      Vertical Axis – Data Size in MB
      Storing UUID Values
      The data size for UUID table is more than the other two tables.
    • Index Size
      Horizontal axis – Number of inserts x 25,000
      Vertical axis – Index Size in MB
      Index Size
    • Total Size
      Horizontal Axis – Number of inserts x 25,000
      Vertical Axis – Total Size in MB
      Total Size
    • Time taken
      Horizontal axis – Number of inserts x 25,000
      Vertical axis – Time Taken in seconds
      Time Taken

For the table with UUID as PRIMARY KEY, you can notice that as the table grows big, the time taken to insert rows is increasing almost linearly. Whereas for other tables, the time taken is almost constant.

The size of the UUID table is almost 50% bigger than Ordered UUID table and 30% bigger than the table with BIGINT as PRIMARY KEY. Comparing the Ordered UUID table BIGINT table, the time is taken to insert rows and the size are almost the same. But they may vary slightly based on the index structure.

Table Structure

Conclusions for storing UUID Values

    • Create a function to rearrange UUID fields and use it

Inserts

Selects

    • Define UUID as binary(16) as binary does not have any character set

 

References

 

Share this post

Comments (42)

  • Kevin Farley

    Definitely an interesting method Karthik.

    But, I think there’s a more efficient, performant and simple way of accomplishing the solution you’re trying to achieve. InnoDB tables, as you know use clustered indexes in the background, meaning the PRIMARY index is part of the actual row (even when no Primary Key is defined). It also supports a special column called ‘ai_col’ which when created as the first column in any table, set to an auto-incrementing unsigned integer, and flagged as the primary key will enable certain built in performance tweaks relating to the clustered index. All secondary indexes use the clustered index key (which is the important part)

    The benefit of the clustered index is that it saves disk io because the complete page/row is returned rather than just the key which requires a second read operation on the specified page to retrieve the actual data.

    In your solution the primary key would be a 16byte binary UUID, and all secondary indexes would be forced to use this meaning much larger index footprints, and I don’t know about you but most of our tables have multiple indexes. You’ve eliminated the IO overhead of inserting UUID’s and re-indexing, but you’ve increased the size of every index in the system as a side effect meaning both more physical and memory footprint.

    By using the ‘ai_col’ first column and using it as the primary key, it solves both of those problems. You would then just create a second column with the true UUID’s, add a UNIQUE index on it. Your rows will have a slightly larger size, but your indexes will be MUCH smaller, resulting in less IO, and a smaller footprint overall.

    You gain the benefits of an ordered insert key and thus low IO, you also gain the benefit of distributed UUID generation across all your app servers without any worry of collision in a cluster / multi-master environment, and you get to stick with standard UUID formatting which keeps things simple.

    I completely agree on the hyphens though, useless 🙂

    Kevin.

    December 19, 2014 at 2:48 pm
  • Steven Barre

    I’ve been using this version. Any comments on its rearranging compared to yours?

    http://stackoverflow.com/a/7168916

    December 19, 2014 at 3:07 pm