November 27, 2014

INSERT ON DUPLICATE KEY UPDATE and summary counters.

INSERT … ON DUPLICATE KEY UPDATE is very powerful but often forgotten MySQL feature. It was introduced in MySQL 4.1 but I still constantly see people unaware of it.

Myself I like this feature big deal because it is designed in truly MySQL style – very efficient solution for freqent task while keeping it beautiful and easy to use.

So what is this feature great for ? Well any kind of maintaining counters. If you’re writing traffic accounting it could be traffic and number of packet passed for given port or IP address. For Web Applications it could be counting number of visits per page or IP address, number of times particular keyword was searched etc.

This functionality also makes it very easy to do incremental single pass log file processing and building summary tables.

Here is example:

This example actually shows one more neat feature of MySQL – inet_aton and inet_ntoa functions which can convert IP address strings to integers and back. This allows to save on field length significantly by using 4 bytes instead of 15

The third feature this example takes advantage of is TIMESTAMP field. By default first TIMESTAMP column will have its value automatically updated to current timestamp on insert and update. We actually could have omitted now() in insert clause but this would require to specify list of columns which we skipped for sake of example.

So how would this example work ? Well just you would expect it. If there is no such IP address in the table it will be added with hits=1 if it is already where (note ip is PRIMARY KEY) it would be just incremented and last visit timestamp updated.

The benefit of using this feature insted of INSERT + UPDATE could be different, depending on number of new rows and data set size. 30% speedup should be typical. Performance increase is not the only benefit – what is even more important the application code becomes simplier – less error prone and easy to read.

About Peter Zaitsev

Peter managed the High Performance Group within MySQL until 2006, when he founded Percona. Peter has a Master's Degree in Computer Science and is an expert in database kernels, computer hardware, and application scaling.

Comments

  1. pimsainnum says:

    If I had tableA field(Id, timestamp, day, value)

    Insert or Update where field (day) not same.

    How to do??

  2. sunxiunan says:

    MySQL is a complete DataBase Server now. But so many people only think it as simple as before.
    Such as it didn’t have transaction process, or don’t support many features.
    I think it is important that study the release notes about it to know how we can use it effectively.

  3. peter says:

    Yes. There are many features being added which might be hard to track. I also would not always encourage all of them just for sake of using performance. Many of new features are complex – for example Stored Procedures. If you start using stored procedures in your code you might need to ensure other members of the team are also fluent with them. INSERT ON DUPLICATE KEY UPDATE is of the different sort – there is not much to learn about it – you can use it safely in your code. If someone does not know about it it takes to read single page to learn.

  4. Paul Wilson says:

    I’ve done a lot of work with traffic databases, and others with IP addresses and one thing appeared very quickly – if possible store the IP address both as an unsigned int as as a char(15). The conversion functions aren’t fast iterated over millions of rows.

  5. peter says:

    Paul,

    I’m actually surprised with your finding. Unless you’re doing indexing and going to do something like where IP like “192.168.%” there should not be serious benefit to store both versions. Yes you save on function call but typically it is not serious response time contributor. You however waste space and especially if you do inexing on this field you end up with larger and slower indexes.

    Of course it depends a lot on workload. For example if you do full web request logging and store page, referer, user agent, cookie information you might not save a lot of space by using integer ip

  6. Apachez says:

    But in the case of “IP like ‘192.168.%'” you could use “IP BETWEEN 3232235520 AND 3232301055″ if you have stored ips as unsigned int which also should be faster I guess to process I guess…

  7. peter says:

    Apachez,

    Yes you’re right of course. I used wrong example to ilustrate an idea. like “%192.168.%.255″ would be better example. And yes you still can use inet_ntoa and apply extra filtering on result. It is just matter of convenience :)

    Especially in log tables where you might want to run some adhoc queries manually ease of use might be more important than couple of percent performance gain.

    Thanks for correction again.

  8. Erick says:

    ON DUPLICATE KEY is the same as an UPDATE command. It is merely a simpler (and confusing) way of doing things. In a typical MySQL way, when this does get invoked, it returns “2 Affected Rows”. Nice. How it is a really effective feature is lost on me —

    OLD WAY:
    1. Insert
    2. If error, then UPDATE.

    NEW WAY:
    1. Insert
    2. Engine discovers if it already exists (through internal error), so UPDATE

    No different.

    As for inet ip address conversion functions, it is still much more recommended to use the functionality provided by your programming language. In the case of PHP for instance, ip2long function will do this for you: http://php.net/ip2long – you can do a simple loop microtime test to see some STARK and REAL difference in performance when you get to medium sized data.

  9. peter says:

    Erick,

    The fact it returns 2 rows affected if only one really was updated is probably a bug. Please report it.

    I can’t say much about being confused by this feature. I think it is rather straightforward – insert data and if it is where update it.

    Now from performance standpoint the difference is in number of statements – for in memory data sets going over network and parsing query is what is taking most of time, this is why having it in single statement helps to improve performance.

  10. CJ says:

    SICK! This should save me some sql calls.

  11. Marki says:

    Peter: It is not a bug, it is a feature (documented one).
    When row is inserted, report it as 1 affected row. When it is updated, report it as 2 affected rows.

  12. peter says:

    Thanks Maki,

    Even though I find it counter intuitive. May be similary to update statement using rows_matched to see the rows which were already in the table would be better idea.

  13. A short note on performance: If you use this feature in order to aggregate reports or something like that (i.e. you have a low insert but high update rate) a strategie like:

    try {
    UPDATE
    } catch (exception) {
    INSERT INTO
    }

    might be better. I notices a speed up of up to 15% using this strategie instead of ON DUPLICATE KEY UPDATE. The higher your actual insert rate becomes the lower is the benefit until it performs worse. As always: Test your application ;)

  14. The problem with inet_aton and inet_ntoa is, that IPv6 is just not supported. So you will end up implementing inet_aton6 by hand (which is tricky) or using varchars again.

  15. Eraser says:

    Hey,
    thanks for this post! I searched for a function like this!

    I love MySQL! Very powerfull!

    Regards,
    E

  16. foo says:

    also useful in combination with unique keys :)
    see http://textsnippets.com/posts/show/975

  17. TM says:

    Is there any way i can use a query similar to this but have some kind of where clause at the end? say if i wanted to do a whole bunch of inserts but update them if they exist already and update the primary key also? so for example
    insert into tmmember values (‘STRES17N’,’SK’,’1725′,’ESTRES17′,’N’,’6′,’255′,’-1′)
    on duplicate key update memb_memberid = ‘STRES17N’, memb_teamid = ‘SK’ only the original memb_member id might be ‘SKRES17N’ so as you can see id like to be able to say where memb_memberid = ‘SKRES17N’ for the cases when its true.

  18. I’m trying to use this on INSERT SELECT statement and im getting an error. my sql is something like this

    INSERT INTO table SELECT * FROM table WHERE id=’2′ ON DUPLICATE KEY UPDATE id=LAST_INSERT_ID() + 1
    I use LAST_INSERT_ID() coz im using auto increment.

  19. peter says:

    Dreamluverz,

    The best place to ask the question is forums.mysqlperformanceblog.com ; and it is also best to provide exact statement and error message :)

  20. Sergio says:

    I’ve tried this but I didn’t notice any performance improvement.

    The UPDATE rate is higher than the INSERT rate.

    So.. peter, from your experience, what’s the FASTEST WAY from the following:

    1) This method I’ve been using until found out about this article. First, we check if there is a record.. then.. see there.

    mysql_query(“SELECT COUNT(*) as count FROM table WHERE user_id = ‘1’”);
    $cnt = mysql_fetch_array();
    $cnt = $cnt[‘count’];

    if ($cnt == 0) {
    mysql_query(“INSERT INTO table VALUES (‘1′, ‘1’)”);
    }
    else {
    mysql_query(“UPDATE table SET user_hits = user_hits+1 WHERE user_id = ‘1’”);
    }

    2) With inspiration from Peter Romianowski’s post. Haven’t tested, but I guess I’ve wrote the code ok.

    mysql_query(“UPDATE table SET user_hits = user_hits+1 WHERE user_id = ‘1’”);
    $affected_rows = mysql_affected_rows();

    if ($affected_rows == 0) mysql_query(“INSERT INTO table VALUES (‘1′, ‘1’)”);

    3) Using INSERT .. ON duplicate KEY UPDATE ..

    mysql_query(“INSERT INTO table VALUES (‘1′, ‘1’) ON duplicate KEY UPDATE user_hits = user_hits+1 “);

    So.. which one is recommended? I repeat.. my update rate is higher than the insert rate. I guess method 2. But I’m waiting other opinions.
    Thanks.

  21. Sergio, I would recommend version 2 for high update rates. But I would also recommend to benchmark your application in high concurrency environment. ;)

    Cheers

    Peter

  22. Sergio says:

    Thanks for reply Peter.

    However, version 2 has a small issue.
    If you try to update a record with the same data as in table, mysql ignore the query (affected rows => 0), and insert query is executed, even if the record exist in table.

    For example, if I refresh twice the page the following query will be the same (if I refresh the page in the same SECOND), so affected rows = 0, thus INSERT INTO query is executed and an error will occur (duplicate entry cos’ of the primary index)

    INSERT INTO forum_sessions (session_id, session_user_id, session_start, session_time, session_ip, session_page, session_logged_in, session_browser, session_mode) VALUES (‘a321587721fdb3a86a49d761fb4b1a2e’, ”, ‘1216230846’, ‘1216230846’, ‘127.0.0.2’, ‘main’, ‘0’, ‘Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9) Gecko/2008052906 Firefox/3.0′, ‘html’)

    Of course errors can be masked for users..

    Can you provide me some tips for benchmarking? The generation time of the pages is almost the same in all 3 situations (I’ve tried all versions on my public site too.. ). localhost, I just refresh the page multiple times & read the smallest generation time.

    Maybe I don’t feel any difference because the tables are small (4,870 rows, 1.4MB & 20-60 rows, < 100KB)

  23. peter says:

    Sergio,

    This is why it is always mentioned in cases count=count+1 and similar – in this case you always change the data. Really MySQL returns both rows matched and rows updated and it is rows matched you need to wants.

    Generally Insert on dup key update is the method which was designed to make it most optimal though the question if it always is remains open :)

    Also note the difference in this cases comes from amount of queries you have to run mainly – run benchmarks over network to see the difference better.

  24. pravin says:

    I have a query in which I’m checking whether LAN(my unique key) is duplicate or not,if it duplicate then update the record else insert that LAN No.

    I did it in the following way.

    insert into romappingfinal(LAN,romappingfinalid,branch_code,
    RO_CODE,main_ro_code,loan_end_date,ro_status,main_ro_status,
    ro_transfer_status,main_ro_transfer_status,ro_name,main_ro_name)
    select LAN,romappingstagingid as romappingfinalid,branch_code,
    RO_CODE,main_ro_code,loan_end_date,ro_status,main_ro_status,
    ro_transfer_status,main_ro_transfer_status,ro_name,main_ro_name
    from viewromappingfinal
    on duplicate key update LAN=values(LAN)

    it’s working fine if i won’t use romappingfinalid(table primary key).

    Suggest me how to use the INSERT ON UPDATE KEY using a table primary key but checking the condition on another primary key(LAN).

  25. zhilly says:

    i sinc i add duplicate key the insert not work
    that not work now
    someone know my little erreur?

    //create and execute the query
    $sql = “INSERT INTO yah_unlimited (data_key,data_value,agent_id,data_group,access_time) VALUES (‘$key’, ‘$value’, ‘$group’, ‘$owner_id’, NOW()) ON DUPLICATE KEY UPDATE access_time = NOW()”;
    $result = mysql_query($sql) or die(mysql_error());
    echo ‘Store successful.';

  26. Jesse says:

    Hello,

    I have a dilema with INSERT (fld1, fld2, fld3) VALUES(‘1′,’1a’,’1b’),(‘2′,’2a’,’2b’),(‘3′,’3a’,’3b’) ON DUPLICATE KEY UPDATE fld2 = ??? ( I want this to be the value that would be inserted ie ‘2a’) (fld1 is a unique index)

    What should go in the ??? spot?

    Thanks.

  27. VALUES(fld2) should go there.

  28. Artis says:

    If I am collecting banner display statistics for each day, then I have table ad_stat with fields:
    id, banner_id, date, views
    with (banner_id, date) as unique

    I use insert into ad_stat (banner_id, date, views) values ($banner_id, ‘$date’, 1) ON DUPLICATE KEY UPDATE views=views+1

    So this query firstly tries to insert and only then to Update – is there opposite way like:
    UPDATE … ON NO ROWS INSERT ???
    Because there are 5 inserts daily and 50 000 updates – wouldn’t it somehow increase performance?

  29. kn33ch41 says:

    A note for those of you who may have problems updating a row that does in fact exist in the table, and so are always seeing a new row inserted:

    In order for this to work, the PRIMARY KEY defined for the table should only contain one column–or if it has multiple columns–the data you’re going to insert/update for the given columns should be identical to the columns defined in the PRIMARY KEY.

    To illustrate:

    PRIMARY_KEY(id, timestamp) == a single key (but with two values, thus more unique)

    The id is very likely always going to be unique. However, adding the timestamp to the primary key makes it even more unique; so, if the id exists in the table and you simply want to update the timestamp, the new value for the timestamp you’re providing will mean that id matches but timestamp does not, because you’ve defined the primary key to be id and timestamp, which means BOTH values you’re inserting/updating must equal the values currently in the database to make the primary key match and therefore update the row instead of insert a new row with the same id, but different timestamp.

  30. I HAVE MISSED THIS FOR TWO YEARS. OH WHAT A S… . … WHAT EVER.
    and now… just now found it while searching thinking why not there be an option like this… and here i am…

    how to do like this

    insert into (field1, field2, field3) values (1,2,3),(1,2,3) and so including on duplicate update. because i want insert/update many records from am json source.
    This should have been posted in a forum but since it deals my problem exactly i thought of posting it here.

  31. Brian May says:

    well, i definetly using INSERT ON DUPLICATE KEY UPDATE rather than REPLACE INTO
    it’s for performance and code reading

  32. Dudeson says:

    Beautiful function. I was on the verge of using a conditional update statement until i discovered this. Anyone knows if there is a mysql function that will shift all records key id down? by adding 1 when I add a new key, say 0 on the TOP of the table?

  33. Scott says:

    With ON DUPLICATE KEY UPDATE is there some sort of temp table involved with the old and new values?
    I need access to the old values on update, but only want to hit the database once.
    It somewhat works with CASE WHEN statements to catch high and low values by day.

    field1= CASE WHEN field1>’$field1′ THEN ‘$field1′ ELSE field1 // this gets a low value
    field2= CASE WHEN field2’$field1′ THEN ‘$field3′ ELSE field3 // this does not work off the existing (old) field1 value!

    This should update field3 when field1 changes. But it does not.

    field3= CASE WHEN field1=’$field1′ THEN ‘$field3′ ELSE field3 // if field3 is time you will only see the most recent value match of field1

    This does work but it looks to see if field1 is equal to the new value.
    The problem here is that I can only have the last match, I would like to have the first match in some cases.

    It seems that old and new values are only available within a given field. Is that so?

    Is there another way to have old/existing values available from one field in the case switch or value of another field.

    Thanks.

  34. Scott says:

    The field2 line above is a mess… it should be

    field2= CASE WHEN field2’$field1′ THEN ‘$field3′ ELSE field3 // this does not work off the existing (old) field1 value!

  35. Steve Jackson says:

    Given the table.

    CREATE TABLE test (
    id int(11) NOT NULL,
    md5 char(32) DEFAULT NULL,
    PRIMARY KEY (id),
    UNIQUE KEY md5 (md5)
    ) ENGINE=InnoDB

    INSERT IGNORE INTO test (md5) values (‘A’)

    1 row affected

    INSERT IGNORE INTO test (md5) values (‘A’)

    Duplicate entry ‘A’ for key ‘md5′

    I think an incredibly useful feature here would be to return the innodb row id of the row which generated the duplicate key error. This would save us a select statement (either before or after the insert) to determine which row caused the dupe error. (I guess the unique row_id would not be of much help, so only return the value of PRIMARY if it is defined)

    Is there something existing like this already? I dumped the mysqli connection object in php, and I see nothing like that. Insert_id of course, but that can only be used when there is an auto_increment column.

    Someone will answer (why not just use on duplicate key update?) Well this is because i want “on duplicate key return unique_row_id”

    Cheers

    //Steve

  36. Gleb Deykalo says:

    What can you say about http://bugs.mysql.com/bug.php?id=52020

    Can you please explain how really locking in INSERT … ON DUPLICATE KEY UPDATE works.

    As I recall:

    Lock 1) Transaction 1 locks row for INSERT
    Lock 2) Transaction 2 tries to locks row for UPDATE (goes into queue)
    Lock 3) Transaction 1 tries to locks row for UPDATE (goes into queue)

    It is reasonable not to release Lock 1, but why MySQL tries to create Lock 3? Why can’t it use Lock 1 instead of Lock 3 for UPDATE operation?

    Or maybe I am wrong at all?

  37. Andrea says:

    Hi guys, I’ve have a doubt: if I’m right the UPDATE statement implied a SELECT statement before the datas modification. So, is the primary key lookup in INSERT INTO…ON DUPLICATE KEY UPDATE faster than the SELECT statement implied in the UPDATE?

    e.g. UPDATE table SET value=newValue WHERE primary=primaryValue;

    INSERT INTO table VALUES primaryValue ON DUPLICATE KEY UPDATE value=newValue

  38. Rob says:

    Just a note on the returned rows affected count:
    above somebody mentioned if it does the update it returns “2 rows affected,” if only does the insert it returns 1.

    Methinks intentional (coz still like that in v5.5) and a damn good idea: and here is what I think way.

    Imagine it returns 1 row affected in either case:
    how long would it be before someone asks: ‘how do I know if it inserted or updated?”
    – hence returning 1 or 2 they have the answer
    — (either way it’s > 0 so if using that as a test it’s fine), and for those using a == 1 test: back to programming school

    (would you prefer they returned “0,1 string” or 0 then 1, or introduced a new second value like @@ROWCOUNT?
    – imagine how much code that would break.)

    “2 row(s) affected” not a bug, it’s a damn good idea and I hope that never changes.

  39. Vladimir says:

    peter, what do you say about the performance of this method?


    CREATE TABLE ipstat(
    ip INT UNSIGNED NOT NULL PRIMARY KEY,
    hits INT UNSIGNED NOT NULL UNIQUE AUTO_INCREMENT,
    last_hit TIMESTAMP
    );

    REPLACE ipstat (ip, last_hit) VALUES (INET_ATON('192.168.0.1'),NOW())

  40. Vladimir says:

    oh, sorry. It work for my single counter

Speak Your Mind

*