Where the open source community meets: Secure your spot for Percona Live Amsterdam! - Register

Downloads

Blog

Enum Fields VS Varchar VS Int + Joined table: What is Faster?

January 24, 2008

Author

Alexey Kovyrin

Benchmarks

Share this Post:

Really often in customers’ application we can see a huge tables with varchar/char fields, with small sets of possible values. These are “state”, “gender”, “status”, “weapon_type”, etc, etc. Frequently we suggest to change such fields to use ENUM column type, but is it really necessary (from performance standpoint)? In this post I’d like to present a small benchmark which shows MySQL performance when you use 3 different approaches: ENUM, VARCHAR and tinyint (+joined table) columns.

In practice you can also often use 4th variant which is not comparable directly, which is using integer value and having value mapping done on application level.

So, first of all, a few words about our data set we’ve used for this benchmark. We have 4 tables:
1) Table with ENUM:

CREATE TABLE cities_enum (
  id int(10) unsigned NOT NULL auto_increment,
  state enum('Alabama','Alaska','Arizona','Arkansas','California','Colorado','Connecticut','Delaware','District of Columbia','Florida','Georgia','Hawaii','Idaho','Illinois','Indiana','Iowa','Kansas','Kentucky','Louisiana','Maine','Maryland','Massachusetts','Michigan','Minnesota','Mississippi','Missouri') NOT NULL,
  city varchar(255) NOT NULL,
  PRIMARY KEY  (id),
  KEY state (state)
) ENGINE=MyISAM;

CREATE TABLE cities_enum (

id int(10) unsigned NOT NULL auto_increment,

state enum('Alabama','Alaska','Arizona','Arkansas','California','Colorado','Connecticut','Delaware','District of Columbia','Florida','Georgia','Hawaii','Idaho','Illinois','Indiana','Iowa','Kansas','Kentucky','Louisiana','Maine','Maryland','Massachusetts','Michigan','Minnesota','Mississippi','Missouri') NOT NULL,

city varchar(255) NOT NULL,

PRIMARY KEY (id),

KEY state (state)

) ENGINE=MyISAM;

2) Table with VARCHAR:

CREATE TABLE cities_varchar (
  id int(10) unsigned NOT NULL auto_increment,
  state varchar(50) NOT NULL,
  city varchar(255) NOT NULL,
  PRIMARY KEY  (id),
  KEY state (state)
) ENGINE=MyISAM;

CREATE TABLE cities_varchar (

id int(10) unsigned NOT NULL auto_increment,

state varchar(50) NOT NULL,

city varchar(255) NOT NULL,

PRIMARY KEY (id),

KEY state (state)

) ENGINE=MyISAM;

3) Table with INT:

CREATE TABLE cities_join (
  id int(10) unsigned NOT NULL auto_increment,
  state_id tinyint(3) unsigned NOT NULL,
  city varchar(255) NOT NULL,
  PRIMARY KEY  (id),
  KEY state_id (state_id)
) ENGINE=MyISAM;

CREATE TABLE cities_join (

id int(10) unsigned NOT NULL auto_increment,

state_id tinyint(3) unsigned NOT NULL,

city varchar(255) NOT NULL,

PRIMARY KEY (id),

KEY state_id (state_id)

) ENGINE=MyISAM;

4) Dictionary table for cities_join:

CREATE TABLE IF NOT EXISTS `states` (
  `id` tinyint(3) NOT NULL auto_increment,
  `name` char(40) NOT NULL,
  PRIMARY KEY  (`id`),
  UNIQUE KEY `name` (`name`)
) ENGINE=MyISAM;

CREATE TABLE IF NOT EXISTS `states` (

`id` tinyint(3) NOT NULL auto_increment,

`name` char(40) NOT NULL,

PRIMARY KEY (`id`),

UNIQUE KEY `name` (`name`)

) ENGINE=MyISAM;

All cities_* tables have 1,5M records each and records are distributed among 29 different states (just happens to be data we had available for tests)

Two important notes about this table before we get to results – this is rather small table which fits in memory in all cases (and dictionary table does too). Second – the rows are relatively short in this table so changing state from VARCHAR to ENUM or TINYINT affects row size significantly. In many cases size difference will be significantly less.

All tests are runned 1000 times and the result time is average from those 1000 runs.

So, our first benchmark is simple: we need to get 5 names of cities, located in Minnesota and, to make things slower, we’ll take those records starting from record #10000 making MySQL to discard first 10000 records.

1) Results for ENUM:

select SQL_NO_CACHE city from cities_enum WHERE state='Minnesota' limit 10000,5;
Result time(mean): 0.082196

1 2	select SQL_NO_CACHE city from cities_enum WHERE state='Minnesota' limit 10000,5; Result time(mean): 0.082196

2) Results for VARCHAR:

select SQL_NO_CACHE city from cities_varchar WHERE state='Minnesota' limit 10000,5;
Result time(mean): 0.085637

1 2	select SQL_NO_CACHE city from cities_varchar WHERE state='Minnesota' limit 10000,5; Result time(mean): 0.085637

3) Results for INT + join:

select SQL_NO_CACHE c.city from cities_join c JOIN states s ON (s.id = c.state_id) WHERE s.name='Minnesota' limit 10000,5;
Result time(mean): 0.083277

1 2	select SQL_NO_CACHE c.city from cities_join c JOIN states s ON (s.id = c.state_id) WHERE s.name='Minnesota' limit 10000,5; Result time(mean): 0.083277

So, as you can see, all three approaches are close with ENUM being fastest and VARCHAR few percent slower.

This may look counterintuitive because table is significantly smaller with ENUM or TINYINT but in fact it is quite expected – This is MyISAM table which is accessed via index, which means to retrieve each row MySQL will have to perform OS system call to read the row, at this point there is not much difference if 20 or 30 bytes are being read. For Full Table Scan operation difference often would be larger.

It is also interesting to note performance of Innodb tables in this case: for VARCHAR it takes about 0.022 per query which makes it about 4 times faster than for MyISAM. This is great example of the case when Innodb is much faster than MyISAM for Read load.

The other surprise could be almost zero cost of the join, which we always claimed to be quite expensive. Indeed there is no cost of the join in this case because there is really no join:

mysql> EXPLAIN select SQL_NO_CACHE c.city from cities_join c JOIN states s ON (s.id = c.state_id) WHERE s.name='Minnesota' limit 10000,5 G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: s
         type: const
possible_keys: PRIMARY,name
          key: name
      key_len: 40
          ref: const
         rows: 1
        Extra: 
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: c
         type: ref
possible_keys: state
          key: state
      key_len: 1
          ref: const
         rows: 225690
        Extra: 
2 rows in set (0.10 sec)

mysql> EXPLAIN select SQL_NO_CACHE c.city from cities_join c JOIN states s ON (s.id = c.state_id) WHERE s.name='Minnesota' limit 10000,5 G

*************************** 1. row ***************************

id: 1

select_type: SIMPLE

table: s

type: const

possible_keys: PRIMARY,name

key: name

key_len: 40

ref: const

rows: 1

Extra:

*************************** 2. row ***************************

id: 1

select_type: SIMPLE

table: c

type: ref

possible_keys: state

key: state

key_len: 1

ref: const

rows: 225690

Extra:

2 rows in set (0.10 sec)

Because we refer state by name, which is unique,it is pre-read and query executed basically on single table querying state by ID.

Next test was a result of my curiosity. I’ve tried to order results by states.

1) Results for ENUM:

select SQL_NO_CACHE city from cities_enum ORDER BY state limit 10000, 5;
Result time(mean): 0.077549

1 2	select SQL_NO_CACHE city from cities_enum ORDER BY state limit 10000, 5; Result time(mean): 0.077549

2) Results for VARCHAR:

select SQL_NO_CACHE city from cities_varchar ORDER BY state limit 10000, 5;
Result time(mean): 0.0854793

1 2	select SQL_NO_CACHE city from cities_varchar ORDER BY state limit 10000, 5; Result time(mean): 0.0854793

select SQL_NO_CACHE c.city from cities_join c JOIN states s ON (s.id = c.state_id) ORDER BY s.name limit 10000,5;
Result time(mean): 26.0854793

1 2	select SQL_NO_CACHE c.city from cities_join c JOIN states s ON (s.id = c.state_id) ORDER BY s.name limit 10000,5; Result time(mean): 26.0854793

As you can see, ENUM and VARCHAR show close performance, while join performance degraded dramatically.

Here is why:

mysql> explain select SQL_NO_CACHE c.city from cities_join c JOIN states s ON (s.id = c.state_id) ORDER BY s.name limit 10000,5G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: c
         type: ALL
possible_keys: state
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 1439943
        Extra: Using temporary; Using filesort
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: s
         type: eq_ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 1
          ref: test.c.state_id
         rows: 1
        Extra: 
2 rows in set (0.00 sec)

mysql> explain select SQL_NO_CACHE c.city from cities_join c JOIN states s ON (s.id = c.state_id) ORDER BY s.name limit 10000,5G

*************************** 1. row ***************************

id: 1

select_type: SIMPLE

table: c

type: ALL

possible_keys: state

key: NULL

key_len: NULL

ref: NULL

rows: 1439943

Extra: Using temporary; Using filesort

*************************** 2. row ***************************

id: 1

select_type: SIMPLE

table: s

type: eq_ref

possible_keys: PRIMARY

key: PRIMARY

key_len: 1

ref: test.c.state_id

rows: 1

Extra:

2 rows in set (0.00 sec)

Because we’re sorting by name we have to perform the join for each row to retrieve it. This also means sort can’t be done by index and extra sort pass (filesort) is required, which also makes MySQL to store Join result in temporary table to do the sort, all together makes things quite miserable. Note this might not be best execution plan to pick in this case but this is other story.

To avoid part of this problem we of course arrange state ids in the alphabetical order and do sort by state_id, though join cost still could be significant.

And the last test – selecting city and name in arbitrary order, skipping first 10000 rows to make query times longer.

1) Results for ENUM:

select SQL_NO_CACHE city, state from cities_enum limit 10000, 5;
Result time(mean): 0.003125

1 2	select SQL_NO_CACHE city, state from cities_enum limit 10000, 5; Result time(mean): 0.003125

2) Results for VARCHAR:

select SQL_NO_CACHE city, state from cities_varchar limit 10000, 5;
Result time(mean): 0.003283

1 2	select SQL_NO_CACHE city, state from cities_varchar limit 10000, 5; Result time(mean): 0.003283

select SQL_NO_CACHE c.city, s.name from cities_join c JOIN states s ON (s.id = c.state_id) limit 10000,5;
Result time(mean): 0.004170

1 2	select SQL_NO_CACHE c.city, s.name from cities_join c JOIN states s ON (s.id = c.state_id) limit 10000,5; Result time(mean): 0.004170

As you can see, ENUM and VARCHAR results are almost the same, but join query performance is 30% lower.
Also note the times themselves – traversing about same amount of rows full table scan performs about 25 times better than accessing rows via index (for the case when data fits in memory!)

So, if you have an application and you need to have some table field with a small set of possible values, I’d still suggest you to use ENUM, but now we can see that performance hit may not be as large as you expect. Though again a lot depends on your data and queries.

0 0 votes

Article Rating

30 Comments

Oldest

Newest Most Voted

Roland Bouman

18 years ago

Thanks! I did expect the enum to be fastest for all results, but it is amazing to see how little difference there sometimes is between the varchar and enum. Now I’m wondering, what if we’d be using CHARs instead of VARCHAR? For example for a set of language or country codes…I’d expect CHAR to be faster than VARCHAR for these small strings, but can you say whether it would be even noticable?

Jeremy Cole

18 years ago

Hi Vadim,

I would like to see these tests run with a real load generation system, e.g. rather than just a loop of 1000 queries. Maybe I will run them through our test system. 🙂 I could come up with some of my own data for the test, but to make things totally fair and comparable, could you post your data sets?

Regards,

Jeremy

Jeremy Cole

18 years ago

Sorry, I meant Alexey, not Vadim. Sorry Alexey. 🙂

Alexey Kovyrin

18 years ago

Jeremy,

I would love to see real system’s result! I’ve tried to create as “real” dataset as possible, but of course, on real tables with more columns, etc it would be even more interesting.

Alexey Kovyrin

18 years ago

Jeremy,

I’ll dump my data now and upload it somewhere soon.

Jay Pipes

18 years ago

Nice results. One thing to keep in mind that the criteria for using ENUM should be small and *static* lookups. Trying to alter the ENUM values on large tables requires a table rebuild, versus a lookup table needs just an INSERT into the lookup table…

Alexey Kovyrin

18 years ago

2Jeremy: http://blog.kovyrin.net/files/mpb/test.sql.bz2

Admin

Peter Zaitsev

18 years ago

Jeremy,

In this case the benchmark was in memory anyway and this is micro benchmark to evaluate particular case. In real life there could be a lot of extra implications, for example getting VARCHAR updated may fragment a row causing quite nasty effects for IO bound workload.

Jay,

Thanks you’re right if you need to change number of values in ENUM it is nightmare. Sometimes you can hack around by avoiding costly ALTER TABLE (just replacing frm) but this is dangerous and not supported 🙂

Jurriaan Persyn

18 years ago

Great to see some discussion and test results on this matter! In our application, we’re mostly using ENUM’s for these kind of problems (mostly because table size really is an issue).
In this comment I want to illustrate why we’re sometimes using the 4th solution you briefly mention; i.e. using a TINYINT and doing the mapping on an application level.
The problem we had with ENUM-fields; When we wanted to add a ‘type’ to an ENUM-field (or change the definition of the field in any way), MySQL often did a complete rewrite of that table. For some of our bigger tables (think tens of millions of records), this could take up to a few hours sometimes, which meant downtime. So for a few cases we decided to switch from ENUM’s to TINYINT’s because this eliminates the need for a change in table definition.

mostafa

18 years ago

but if we make mistake in updating(include add,change,del) a field of enum type may be we lost data and it is awful

Admin

Peter Zaitsev

18 years ago

Mostafa – only if you screw up badly.

1) you need to have backups of course.
2) you better to preserve old .frm file and if you change is not successful.
3) you better test it outside of production to make sure such change works.

But you’re right it is a bit dangerous 🙂

Greg Knapp

18 years ago

Not sure if this will have any impact on queries performance but your column definitions for state_id differ slightly between your dictionary table and cities_join table; cities_join.state_id is UNSIGNED and states.id isn’t.

Kipp

18 years ago

Does anyone know if/how these vary using sets. Maybe comparing set v. varchar list v. int used bitwise?

*confused*

18 years ago

16 bit fixed length integers vs random length strings (32 to 400 bits)

Shouldn’t the enum be several orders of magnitude faster, especially on sorts? I don’t understand.
Please explain to me why the benchmark shows only negligible differences.

Raymond

18 years ago

Thank you for these tests, it helped us find a solution to several issues before implementing.

Hakan

18 years ago

Thanks for the test.

I think using the 4th way depends on the application. On our case, we store this kind of information in TINYINT and do the work in application, as we only select 25 of them at once. But when selecting thousands of row, and making change in all of them may not be better than storing in ENUM or varchar. I’ll do some tests about this when I have time and post here.

somebody

17 years ago

have you guys tried indexing state_id on the cities_join table.
though I read something about frequent inserts with indexes being slow.
but would indexing speed things up?

and also

SELECT SQL_NO_CACHE c.city FROM states s LEFT JOIN cities_join c ON (s.id = c.state_id) WHERE s.name=’Minnesota’ LIMIT 10000,5;

since your searching by state maybe if you place it at the left mysql would search for that first and then search for the cities with similar state_id, and it will be ordered automatically.

ENUM is what got me to this blog, but since you guys said its dangerous, it’s no longer in my options.

praca

17 years ago

I use tinyint(1) to present value 0 or 1.
I thing enum, boolean and others methods works slower.

Farrukh

16 years ago

In the test you used int+join table, what if we move the join table functionality to our application and just compare ENUM VS CHAR VS INT.

Tadas Sasnauskas

16 years ago

VARCHAR is nightmare when database/code is not documented. In case of enum you at least know possible value set.

Praveena

15 years ago

What is the maximum recommended length of a ENUM?

Wil Moore III

14 years ago

‘@Praveena,

The length of the ENUM value doesn’t matter much (unless you are approaching the table definition limit)…the ENUM “index” is what is stored.

In other words, if you were to insert a value into an ENUM column, but instead of the actual value, insert the index (for instance, 0 for the first item, or 1 for the second item), you would notice that once you’ve viewed the table, you would see the expected value listed.

This is great for performance; however, it can lead to some very subtle bugs if you aren’t aware of what is actually going on.

Wil Moore III

14 years ago

Sorry,

Actually, the indexes start @ 1 (0 signifies there was a data validation error). How is that for tricky 🙂

MattK

13 years ago

I believe that you second benchmark is not quite equivalent.

When sorting the enum you are sorting by the enumeration, not the value. This would be equivalent to sorting the INT + join by the state_id alone. It just so happens that your enumeration of states is in order. If we can assume this for the ENUM, why can’t we assume this for the referenced state table?

Try sorting the enum by states in order if the enumeration itself is not in order. 🙂

yon

13 years ago

after a deep benchmarking , found using int and joint table the best case for my project!

Thomas

13 years ago

‘@Will Moore III,

That caused a few bugs for us. One of my predecessors thought it wise to use ENUM(‘0′,’1’) instead of TINYINT(1) for storing boolean values.

SET foobar = ‘1’ would come out truthy, SET foobar = 1 would come out falsy. Brilliant!

tony

11 years ago

I always experienced lots of problems using MySQL ENUMs , and never going to use it again.

paul

11 years ago

I remember the project which we used ENUMs for a field of big table, which got about 4 billion rows. finally we where forced to add an extra possible value to that field and we were involved handling it for a whole week!

Ben Ellis

10 years ago

I’ve only recently started using mySQL from previously using MSSQL so my interpretation of how mySQL works may be wrong.

Your test results don’t surprise me as you mentioned the table data would easily fit into memory and I’m assuming there wasn’t much else going on in the database when you ran the tests.

I think you’ll find the biggest difference in performance is when there is a lot of paging going on as more records/indexes will fit into a page if you use enum or int data types than if you use varchar datatype, therefore there is less disk IO and better performance.

You could possibly emulate this by reducing the amount of memory used by mySQL so it has to be more clever with it’s memory management, or, have several bigger SQL queries running at the same time on other large tables in the database to the point the memory is filled with data from other tables.

You might also consider evaluating insert and update performance as well as I think there will be small gains in these areas (tiny bit more writing to the transaction log).

Milos

9 years ago

Is there any chance that anyone have data-set used in this article? Would really like to test some things with it, and generating new one would take a lot of time.