Announcement

Announcement Module
Collapse
No announcement yet.

What is the fastest way to detect data change in a table

Page Title Module
Move Remove Collapse
X
Conversation Detail Module
Collapse
  • Filter
  • Time
  • Show
Clear All
new posts

  • What is the fastest way to detect data change in a table

    Hello,

    First post here, hope it's not last

    I'm currently using Percona Server 5.5.18-rel23.0-log in production. I have 2 servers in a multi-master mode replication, with application level load ballancing and failover. Each server is doing average of 2000 QPS with peaks of > 30K QPS.
    This is top 3 from innotop:


    Com_select 178263485 71.80%Com_insert 21950184 8.84% Com_update 17436557 7.02%Com_begin 12200095 4.91%Com_change_db 6586734 2.65%Com_set_option 6449891 2.60%Com_delete 2167565 0.87%Com_replace 1736384 0.70%


    The application is a monitoring system, and the reason for the high number of SELECTs is that every client polls the database for changes every second. There are about 100 tables that may change at any time, and the data from them needs to be send to the clients ASAP.
    Prior to using Percona Server, I was running standard MySQL with MyISAM tables, so the application was built with that in mind.

    The method I was using was MyISAMs Live Checksums (checksum=1 in DDL), and the clients was issuing every second "CHECKSUM TABLE xxxx", storing the result in a session variable, and then comparing it the next second with the new value. This was VERY fast, because MyISAM stores the checksums and returns them without the need to recalculate.

    When i first tried to migrate to InnoDB i was faced with the problem that Innodb doesn't have live checksums, and this led to performance unacceptable for production as each client (there are about 70 clients online in busy hours and 30-40 in non-working hours) performs CHECKSUM TABLE to 10-20 tables each second. So after a research i found that Percona Server has a table in information_schema called innodb_table_stats. In this table i've found that the column "modified" increments when the table is updated. So long story short, now I use Percona Server and instead of CHECKSUM TABLE i'm doing a "SELECT rows, modified FROM information_schema.innodb_table_stats" and then i'm making a checksum of these two. Actually here is the function i've come up with (it even has a backwards compatibility part):


    function checksum_table($input_tables){ if(!$input_tables) return false; // Sanity check $tables = (is_array($input_tables)) ? $input_tables : array($input_tables); // Make $tables always an array $where = ""; $checksum = ""; $found_tables = array(); $tables_indexed = array(); foreach($tables as $table_name){ $tables_indexed[$table_name] = true; // Indexed array for faster searching if(strstr($table_name,".")){ // If we are passing db.table_name $table_name_split = explode(".",$table_name); $where .= "(table_schema='".$table_name_split[0]."' AND table_name='".$table_name_split[1]."') OR "; }else{ $where .= "(table_schema=DATABASE() AND table_name='".$table_name."') OR "; } } if($where != ""){ // Sanity check $where = substr($where,0,-4); // Remove the last "OR" $get_chksum = mysql_query("SELECT table_schema, table_name, rows, modified FROM information_schema.innodb_table_stats WHERE ".$where); while($row = mysql_fetch_assoc($get_chksum)){ if($tables_indexed[$row[table_name]]){ // Not entirely foolproof, but saves some queries like "SELECT DATABASE()" to find out the current database $found_tables[$row[table_name]] = true; }elseif($tables_indexed[$row[table_schema].".".$row[table_name]]){ $found_tables[$row[table_schema].".".$row[table_name]] = true; } $checksum .= "_".$row[rows]."_".$row[modified]."_"; } } foreach($tables as $table_name){ if(!$found_tables[$table_name]){ // Table is not found in information_schema.innodb_table_stats (Probably not InnoDB table or not using Percona Server) $get_chksum = mysql_query("CHECKSUM TABLE ".$table_name); // Checksuming the old-fashioned way $chksum = mysql_fetch_assoc($get_chksum); $checksum .= "_".$chksum[Checksum]."_"; } } $checksum = sprintf("%s",crc32($checksum)); // Using crc32 because it's faster than md5(). Must be returned as string to prevent PHPs signed integer problems. return $checksum;}



    So it seems everything is OK now, but it turned out it's not. The problem is that the column "modified" is incremented only when doing an update on indexed columns. For example if I have table with 3 columns (id (PK), test1 (INDEX), test2) and do UPDATE test1='newvalue' WHERE id=x the "modified" column for this table in innodb_table_stats increments, but if i do UPDATE test2='newvalue' WHERE id=x it doesn't increment.

    Also doing 1000-1500 SELECTs per second from information_schema seems rather hacky for me, but it works...

    And finally the question: Can you recommend a better way to detect a change in any table, without managing hundreds of triggers, or adding "last_udpate" columns. Also the database servers are remote, so any file checking method is out of the question too.
    So any suggestions?

    Thanks!

    P.S. I've asked this on Stack Overflow, but i didn't receive any good help there. Anyway here's the link - http://stackoverflow.com/questions/8643018/mysql-fastest-way -to-check-if-data-in-innodb-table-has-changed/8643198#864319 8

  • #2
    The only thing I can think of is that you create a "tableVersionIncrement" table something like:

    CREATE TABLE tableVersionIncrement( tableName CHAR(10) PRIMARY KEY ,version INT NOT NULL) ENGINE=INNODB

    (even better from performance standpoint would be to have int's for the tableName but it would become more cumbersum to work with and I don't think it would change performance that much since the table would be very small anyway).

    Then you fill it with one row for each table and when a table is updated you increment the corresponding row with:

    UPDATE tableVersionIncrement SET version = version+1 WHERE tableName = 'yourTable'


    That way all your selects can be performed against this table which will be really small and all access will be on primary key to speed things up.
    And by leaving it as an InnoDB the updates doesn't block each other as in a MEMORY table and state is kept throughout a restart.

    Now question is if you have any control over how the clients write/update data in the tables so you can place the update of this table there.
    Or you are back to adding a trigger on each table to handle the increment in case of an insert/update/delete.

    Which might not be what you wanted to hear

    Hope you find a good solution and let us know what it is.

    Comment


    • #3
      Thanks for your reply sterin.

      Unfortunately the application is very late in development and there is no single wrapper around database calls, so it will be very hard to search for all queries that modify table data and add this additional statement, otherwise the idea is good, and it will be fast, but as i said this will be a great amount of effort for now.

      I am currently researching another solution, but I must do some extensive performance testing first. The solution is to have one persistent daemon, which will periodically do "CHECKSUM TABLE xxx" on all tables marked with "CHECKSUM=1" and will cache the checksums in a reference table like the one you proposed. I will report my progress, but any other ideas will be greatly appreciated.

      UPDATE:

      This method proved to be pretty heavy on the database. Here are some graphs from the tests (there are from an actual production server):









      The spike in the load (it's actually a gap on the MySQL graphs) at about 12:10 is because i accidentally spawned 100+ threads before I managed to kill the daemon :roll:.

      In the first run I've forgot to add sleep() at the end of the test loop and went for lunch :roll: You can see what's the impact on the database. The second run was with 3 second sleep between the test loops. The daemon works with 2 threads, and it's doing an average (measured for 10 seconds) of 17 table checksums per second with sleep(3), and 30 without sleep. It's finishing a loop in about 2 seconds. There are 70 tables to be checksummed with an average of 50 000 rows (max 670 000).

      There's an option here to use one slave for this, but this will make slaves critical for the application and this is not what i want right now...

      So i'm still looking for other options.

      Comment


      • #4
        Sterins suggestion can be implemented with triggers as well.

        Comment


        • #5
          I can suggest a few approaches, but probably the best way is to use a trigger that has Roland Bouman's hack to only fire on the first row of a multi-row update (for efficiency). I wrote about this in High Performance MySQL. Let me know if you don't have that book and I can dig out the code, but I don't recall it offhand.

          What should the trigger do? Something as cheap as possible. It could call a memcached function or another UDF to inexpensively update something external to the database (perhaps just send a UDP packet). If it's internal to the DB it's likely to cause a lot of overhead. Alternatively it could update a row in a MEMORY table, but that might need testing to validate the performance overhead.

          Comment


          • #6
            Thank you Baron for the answer!

            I think i will try the application level approach - refactoring the code to abstract mysql_query() statements in a wrapper function such as this:


            function _mysql_query($query,$link_identifier=null){ $test_query = str_replace(array("`"," . ",". "," ."),array("",".",".","."),trim($query)); preg_match("/^UPDATE[\s]+([[:word:].]+)[\s]+/i",$test_query,$matches); if($matches[1]){ if($GLOBALS['last_db_name'] == "local"){ // We are working with our own database mysql_query("REPLACE INTO table_checksums (table_name,checksum) VALUES ('".$matches[1]."',".mt_rand().")"); //us } } return mysql_query($query);}


            I have evaluated this approach and i think it will be better and less time consuming than adding/testing/maintaining triggers. The other problem with triggers is that some of these tables already have some defined and this means automatic maintenance of them will be a pain.
            I'm aware that mt_rand() is a PRNG but it's good enough for this situation.

            This is the table:


            CREATE TABLE `table_checksums` ( `table_name` varchar(64) NOT NULL, `checksum` int(11) unsigned DEFAULT NULL, PRIMARY KEY (`table_name`)) ENGINE=InnoDB DEFAULT CHARSET=utf8;


            The load will be around 1/10 (UPDATE/SELECT). Will MEMORY engine with HASH index speed this up?

            I've tried rename_function() (that comes with APD) to rename the built_in mysql_query() function and substitute it with my own, but it's pretty unstable in production - lots of random segfaults.

            The suggestion for Memcached is really good, but i'm still trying to avoid such dependencies for core operations of this system. I'm using Memcached and APC, but they can be thrown away at any time with only performance penalties for the application, not with loss of functionality.

            I will post updates later

            Comment


            • #7
              So here's the promised UPDATE:

              Here is the final wrapper function for the queries:


              function mysql_query_checksum($query,$link_identifier=null) { $test_query = str_replace(array("`"," . ",". "," ."),array("",".",".","."),trim($query)); preg_match("/^UPDATE[\s]+([[:word:].]+)|^INSERT[\s]+INTO[\s]+([[:word:].]+)[\s]+|^REPLACE[\s]+INTO[\s]+([[:word:].]+)[\s]+|^DELETE[\s]+FROM[\s]+([[:word:].]+)|TRUNCATE[\s]+TABLE[\s]+([[:word:].]+)+/i",$test_query,$matches); $table_name = false; for($i=1;$i<=count($matches)-1;$i++){ if($matches[$i]){ $table_name = $matches[$i]; break; } } if($table_name !== false && $GLOBALS['last_db_name'] == "local"){ mysql_query("INSERT INTO table_checksums (table_name,checksum) VALUES ('".$table_name."',1) ON DUPLICATE KEY UPDATE checksum=checksum+1"); } return mysql_query($query);}


              Incrementing the checksum value instead of inserting random one gives me the ability to check which tables are updated more often and do some optimizations on the modification queries.

              And here is the checking function:


              function checksum_table($tables){ if(!$tables) return false; if(!is_array($tables)){ $where = "table_name='".$tables."'"; }else{ $where = "table_name IN ('".implode("','",$tables)."')"; } $checksum = ""; $get_chksum = mysql_query("SELECT table_name, checksum FROM table_checksums WHERE ".$where." ORDER BY table_name ASC"); while($row = mysql_fetch_assoc($get_chksum)){ $checksum .= "_".$row[checksum]."_"; } $checksum = md5($checksum); return $checksum;}


              I have removed the table_schema column, because I'm interested in only one databsae for these tables anyway.

              Now to the numbers. I've logged the calls to both functions and played with InnoDB and Memory engines for table_checksums table. These are only the times in microseconds that the mysql_query() calls in the functions took, not the whole function. The tested time window is about 10 minutes per engine.

              innodb insert times: AVG:0.000436 MAX: 0.058303
              memory insert times: AVG:0.000402 MAX: 0.043116
              innodb select times: AVG:0.000337 MAX: 0.044086
              memory select times: AVG:0.000321 MAX: 0.096144

              updates to table_checksums per second: AVG:257.1 MAX: 2124
              selects from table_checksums per second: AVG:938.4 MAX: 1387

              The numbers are pretty close, and there wasn't any noticeable difference on the clients, but from the graphs from the database i saw increased number of Innodb Row Locks Waits when using InnoDB. It was 0 before i put the new functions in production. So for now i'm staying with Memory engine.

              You can see on the following graphs, that this many selects from information_schema are not as light as it seems. Some graphs are with different time interval, because there were some spikes caused by the testing, which are stretching them badly...













              I think this is a good optimization for today

              Comment


              • #8
                I've also made some improvements to the functions. Now if we are in a transaction and call mysql_query_checksum() it will only cache the modified table in a global array and will update the checksum only once on commit, but you must start and commit all trasactions with db_start_trx() and db_end_trx():


                function mysql_query_checksum($query,$link_identifier=null) { //if($GLOBALS['last_db_name'] == "local"){ // Remove this if you are not me $test_query = str_replace(array("`"," . ",". "," ."),array("",".",".","."),trim($query)); preg_match("/^UPDATE[\s]+([[:word:].]+)|^INSERT[\s]+INTO[\s]+([[:word:].]+)[\s]+|^REPLACE[\s]+INTO[\s]+([[:word:].]+)[\s]+|^DELETE[\s]+FROM[\s]+([[:word:].]+)|TRUNCATE[\s]+TABLE[\s]+([[:word:].]+)+/i",$test_query,$matches); $table_name = false; for($i=1;$i<=count($matches)-1;$i++){ if($matches[$i]){ $table_name = $matches[$i]; break; } } if($table_name !== false){ if($GLOBALS['db_in_trx'] === true){ $GLOBALS['modified_table_checksums'][$table_name] = 1; }else{ db_update_checksum($table_name); } } //} return mysql_query($query);}function db_update_checksum($table_name){ mysql_query("INSERT INTO table_checksums (table_name,checksum) VALUES ('".$table_name."',1) ON DUPLICATE KEY UPDATE checksum=checksum+1");}function db_begin_trx(){ // Alias function db_start_trx();}function db_start_trx(){ mysql_query("START TRANSACTION"); $GLOBALS[db_in_trx] = true;}function db_commit_trx(){ mysql_query("COMMIT"); if(!empty($GLOBALS['modified_table_checksums'])){ foreach($GLOBALS['modified_table_checksums'] as $table_name => $flag){ db_update_checksum($table_name); } } $GLOBALS[db_in_trx] = false;}function db_rollback_trx(){ mysql_query("ROLLBACK"); $GLOBALS[db_in_trx] = false;}function db_restart_trx(){ db_commit_trx(); db_start_trx();}


                The you do something like this

                db_start_trx();... update some tables N timesdb_commit_trx(); // <-- checksums are updated only once per table at this point, so we are saving N (minus number of tables) round trips to the database


                Hope this helps someone in a need like me.

                Just to summarize...
                If someone is asking the same question as me in the topic - do it in the application! You can do it with live checksums (like i did a long time ago) if you are using MyISAM, buy you will save yourself some time and headbanging refactoring the code when you want to move to InnoDB.

                Comment


                • #9
                  Thanks for the update, interesting reading!

                  I would probably have gone with InnoDB even though you got more row locks, you'd still get a continuous increment which simplifies the checking and would give you some history.

                  And I would have gone with:
                  `checksum` int(11) unsigned NOT NULL DEFAULT 0,
                  To avoid the need for null values.
                  But those are semantics.

                  It's nice to hear that you found a good solution!

                  Comment

                  Working...
                  X