Picking datatype for STATUS fieldsPeter Zaitsev
Quite commonly in the applications you would need to use some kind of “status” field – status of order – “new”, “confirmed”, “in production”, “shipped” status of job, message etc. People use variety of ways to handle them often without giving enough thought to the choice which can cause problems later.
Perhaps worst, though quite common thing is to define such field as VARCHAR(255) . Even though the stored value is often short the full specified length can be used for internal processing, such as when creating temporary table or sorting.
Before we go to the list of variants which can be used lets talk about two important types of these status fields. First is BOOLEAN or YES/NO type. You would frequently see it as columns VISIBLE, DELETED , CAN_ACCESS etc. In case things are as easy as YES/NO using TINYINT UNSIGNED NOT NULL is quite a good idea, though using ENUM is also fine choice as you’re not planning to change allowed values for such field anyway.
For status fields with more than two values there are commonly following choices:
VARCHAR(N) – This is the most straightforward approach though it is also most inefficient one from performance standpoint. However how much of performance are you going to lose? This depends on number of rows as well as their average length. If you have 1000 of rows you would not save much of space by using TINYINT instead of VARCHAR. Same if you have 1000 bytes per row using 10 byte VARCHAR(10) will not waste too much space. The other problem you may run into is slower JOIN performance especially for MyISAM Tables. Though it is solvable as well.
The benefit of using this solution is obviously self documenting. Values such as “New”, “Shipped”, “Confirmed” are much more descriptive than 1,2,3 and do not need to be separately maintained. However this benefit also comes comes as a drawbacks as it is very easy to store not intended statuses in the system and it is harder to spot them. For example you may have somewhere in the code use “Shiped” value which can cause hard to catch errors. Also if you’re using actual values you may have temptation to display them directly in your application interface which can cause extra problems if you need to support multiple languages. In any case if you’re picking this approach you should make sure the values you use are short. Pick 8, may be 12 characters and use single byte character set to keep the size as short as possible.
ENUM – This is what VARCHAR(N) columns get converted during the process of application optimization. It is very nice as it often allows to save on the storage and get extra performance without application modification. ENUM takes just 1 byte for small sets so it is rather efficient. The problem with ENUM is adding new values using traditional way (ALTER TABLE) causes table rebuild which is not acceptable for many environments. It is possible to hack this around though I would not like to do this as a standard production practice. It is also worth to note ENUM comes with couple of gotchas – first if you store value which is not allowed by ENUM list it will be always converted to ” – empty string, the additional ENUM columns can have. The other gotcha is sort order which is done for ENUM in the order in which values are specified in the ENUM declaration – not according to their string values. So make sure to add values specify strings for ENUM in sorted way to get same sorting as before.
TINYINT – Or you can just use TINYINT which has very small footprint, fast for queries and has no problem with adding values to it, however it has also drawback of integer values not being self documenting and can’t be used in interfaces. TINYINT is a good choice if you can maintain a discipline and have the table which has integer to string mapping
or at least have constants with descriptive names defined, so you do not use numbers in the application directly. I prefer to have the table though because correctly architecting your queries you can use such table with very low overhead while it can become very useful if you plan to extend statuses with further meta data, for example descriptive help message, information when this status is added or flag to mark this is deprecated status value and can’t be used.
In general there are reasons to use all 3 approaches in the applications, just do not do it by randomly picked advice but think what kind of application do you have and what are the most important goals for you. Hope this article will help you to do the right pick.