Buy Percona ServicesBuy Now!
Subscribe to Latest MySQL Performance Blog posts feed
Updated: 1 hour 27 min ago

How Binary Logs (and Filesystems) Affect MySQL Performance

May 4, 2018 - 3:50pm

I want to take a closer look at MySQL performance with binary logs enabled on different filesystems, especially as MySQL 8.0 comes with binary logs enabled by default.

As part of my benchmarks of the MyRocks storage engine, I’ve noticed an unusual variance in throughput for the InnoDB storage engine, even though we spent a lot of time making it as stable as possible in Percona Server for MySQL. In the end, the culprit was enabled binary logs. There is also always the question, “If there is a problem with EXT4, does XFS perform differently?” To answer that, I will repeat the same benchmark on the EXT4 and XFS filesystems.

You can find our previous experiments with binary logs here:

Benchmark Setup

A short overview of the benchmark setup:

  • Percona Server for MySQL 5.7.21
  • InnoDB storage engine
  • In contrast to the previous benchmark, I enabled foreign keys, used REPEATABLE-READ isolation level, and I used UTF8 character sets. Because of these changes, the results are not really comparable with the previous results.
  • The dataset is the same: sysbench-tpcc with ten tables and 100 warehouses, resulting in a total of 1000 warehouses, and about a 90GB dataset size.
  • I will use innodb_buffer_pool_size 80GB, 70GB, and 60GB to emulate different IO loads and evaluate how that affects binary logs writes.
Initial Results

For the first run, let’s check the results without binary logs vs. with binary log enabled, but with sync_binlog=0:

We can see that results without binary logs are generally better, but we can also see that with binary logs enabled and sync_binglog=0, there are regular drops to 0 for 1-2 seconds. This basically results in stalls in any connected application.

So, enabling binary logs may result in regular application stalls. The reason for this is that there is a limit on the size of the binary log file (max_binlog_size), which is 1GB. When the limit is reached, MySQL has to perform a binary log rotation. With sync_binlog=0, all previous writes to the binary log are cached in the OS cache, and during rotation, MySQL forces synchronous flushing of all changes to disk. This results in complete stalls every ~40 seconds (the amount of time it takes to fill 1GB of binary log in the above tests).

How can we deal with this? The obvious solution is to enable more frequent sync writes of binary logs. This can be achieved by setting sync_binlog > 0. The popular choice is the most strict, sync_binlog=1, providing the most guarantees. The strict setting also comes with noted performance penalties. I will also test sync_binlog=1000 and sync_binlog=10000, which means perform synchronous writes of binary logs every 1000 and 10000 transactions, respectively.

The Results

The same results in a tabular format with median throughput (tps, more is better)

Bp sync_binlog 0 1 1000 10000 nobinlog 60 GB 4174.945 3598.12 3950.19 4205.165 4277.955 70 GB 5053.11 4541.985 4714 4997.875 5328.96 80 GB 5701.985 5263.375 5303.145 5664.155 6087.925


Some conclusions we can make:

  • sync_binlog=1 comes with the biggest performance penalty, but with minimal variance. This is comparable to running without binary logs.
  • sync_binlog=0 provides best (for enabled binary logs) performance, but the variance is huge.
  • sync_binlog=1000 is a good compromise, providing better performance than sync_binlog=1 with minimal variance.
  • sync_binlog=10000 might not be good, showing less variance than with 0, but it is still big.

So what value should we use? This is probably a choice between sync_binlog=1 or some value like 1000. It depends on your use case and your storage solution. In the case of slow storage, sync_binlog=1 may show a bigger penalty compared to what I can see on my enterprise SATA SSD SAMSUNG SM863.


All of the above results were on an EXT4 filesystem. Let’s compare to XFS. Will it show different throughput and variance?

The median throughput in tabular format:

sync_binlog Buffer pool (GB) EXT4 XFS 0 60 4174.945 3902.055 0 70 5053.11 4884.075 0 80 5701.985 5596.025 1 60 3598.12 3526.545 1 70 4541.985 4538.455 1 80 5263.375 5255.38 1000 60 3950.19 3620.05 1000 70 4714 4526.49 1000 80 5303.145 5150.11 10000 60 4205.165 3874.03 10000 70 4997.875 4845.85 10000 80 5664.155 5557.61 No binlog 60 4277.955 4169.215 No binlog 70 5328.96 5139.625 No binlog 80 6087.925 5957.015


We can observe the general trend that median throughput on XFS is a little worse than with EXT4, with practically identical variance.

The difference in throughput is minimal. You can use either XFS or EXT4.

Hardware Spec

Supermicro server:

  • Intel(R) Xeon(R) CPU E5-2683 v3 @ 2.00GHz
  • 2 sockets / 28 cores / 56 threads
  • Memory: 256GB of RAM
  • Storage: SAMSUNG  SM863 1.9TB Enterprise SSD
  • Filesystem: ext4/xfs
  • Percona-Server-5.7.21-20
  • OS: Ubuntu 16.04.4, kernel 4.13.0-36-generic
Extra Raw Results, Scripts and Config

My goal is to provide fully repeatable benchmarks. To that effect, I’ve shared all the scripts and settings I used in the following GitHub repo:

The post How Binary Logs (and Filesystems) Affect MySQL Performance appeared first on Percona Database Performance Blog.

This Week in Data with Colin Charles 38: Percona Live Europe 2018 and PostgreSQL

May 4, 2018 - 10:35am

Join Percona Chief Evangelist Colin Charles as he covers happenings, gives pointers and provides musings on the open source database community.

The week after Percona Live Santa Clara 2018 tends to be much quieter, aided by the fact that I took a few days away during Labor Day. The next thing to look out for is Percona Live Europe 2018, which at this stage is really a note to let you save the dates: November 5-7 2018, at the Radisson Blu, in Frankfurt. There is no call for papers yet, there is no committee, and it is not listed yet at the Percona Live Conferences page. Hang in there! We’ll open the call for papers soon!

Now that Percona is in the PostgreSQL space, it seems prudent that there will also be more PostgreSQL content here. A great resource naturally is Planet PostgreSQL. There also seems to be another resource on The internals of PostgreSQL, and as books go, Mastering PostgreSQL in Application Development sure looks very interesting. Do you have recommended resources?

Releases Link List Upcoming appearances Feedback

I look forward to feedback/tips via e-mail at or on Twitter @bytebot.

The post This Week in Data with Colin Charles 38: Percona Live Europe 2018 and PostgreSQL appeared first on Percona Database Performance Blog.

Percona Live 2018 Community Report

May 4, 2018 - 9:41am

So, after a whirlwind few days, Percona Live 2018 has been and gone. There was a great energy about the conference, and it was fantastic to meet so many open source database enthusiasts and supporters. A few things that I experienced:

  • Your great willingness to share knowledge. It was a fantastic place to learn for those who have experience from a different field of technology. Almost everyone seemed to be very open and generous with their time.
  • The “superstars” from our industry are not so scary. They are as willing to be open and generous with their experience and views as any of the other attendees, and equally as interested in making new discoveries.
  • There aren’t many times you can sit down to a (community) dinner, to share food and anecdotes with people from USA, UK, Germany and Armenia at the same time. I thoroughly enjoyed the company, and wish there were more opportunities for similar encounters. Thanks to Pythian for setting that up.
  • My Percona colleagues are wonderful, committed human beings with more than a passing interest in music – the Percona Sessions have got to happen…
  • That you can run a very long way in a day between the Santa Clara Convention Center and the Hyatt Regency Hotel.

I had very many positive conversations with delegates. You offered any criticisms along with a suggestion of how we should tweak things for the better. Our community is a creative, generous, problem-solving machine, though I shouldn’t be surprised at that.

So, with only a few more duties to complete, I’d like to thank you for your company. For those that did not make it to this year’s event, I hope that you might be persuaded to join us in the future — either at Percona Live Europe 2018 or at Percona Live 2019.

Packt Prizes

Our media sponsor, Packt, generously provided us with three free ebooks and two free instruction videos as prizes for delegates:

  1. Mastering MongoDB 3.x
  2. MySQL 8 Cookbook
  3. MongoDB Administrator’s Guide
  4. Elastic Databases and Data Processing with AWS [Video]
  5. AWS Administration – Database, Networking, and Beyond [Video]

There are another 10 titles for which we can offer delegates a 50% discount: you should have received your emails. Thanks are due again to Packt.

Community Blog

While I have your attention, I’d like to let you know about the forthcoming Percona community blog. Having been some time in the planning, this is starting really soon, and is like a year-round, online, Percona Live. We already have some keen writers for this, but if you would be interested in creating content (whether written, podcast or webcast) for the community blog, then please get in touch. The brief is very wide — as long as your submission is relevant to the open source database community then it would be welcome.

Finally, I would like to invite feedback on how to make the event shine even brighter — please drop me an email if you have suggestions or ideas. Meanwhile, I hope you enjoy these photographs of the MySQL Community Awards Winners, presented at PL18. You can read more about this community initiative.

Perhaps you’ll be able to join us in Frankfurt in November? Time to start thinking about those submissions for the call for papers!

Or perhaps next year at Percona Live Open Source Database Conference in 2019 – wherever it may be!

Photographs: Randy Tunnell Photography

The post Percona Live 2018 Community Report appeared first on Percona Database Performance Blog.

Q&A: “Percona XtraDB Cluster 5.7 and ProxySQL for Your High Availability Needs” Webinar

May 3, 2018 - 4:48pm

On March 22, 2018, we held a webinar on how Percona XtraDB cluster 5.7 (PXC) and ProxySQL can help achieve your database clustering high availability needs. Firstly, thanks to all the attendees for taking time to attend the webinar and we are sure you had a webinar experience. We tried answering some of your high availability questions during the call but due to time restrictions if we missed some of the questions then this blog will help clarify them.

Q. You say the replication to servers is virtually synchronous, if there is any latency, does ProxySQL detect this and select a node accordingly?

A. PXC nodes are virtually synchronous, which effectively means while the apply/commit of a transaction may be in progress on one node, other nodes may have completed applying it. There is no direct way for ProxySQL to know about this, but it could be traced by looking at wsrep_last_applied and wsrep_last_committed. Also, if a user expects to always fetch updated data, then a wsrep_sync_wait configuration can be used.

Q. Hello, do you suggest geoReplication / wan clustering for an e-commerce website? Let ‘s say served by an Italian pxc cluster and served by a US PXC cluster?

A. Geo-distributed PXC is already in use by a lot of customers, and is meant to exactly serve the use-case you have pointed out. An important aspect of geo-distributed clustering (that often gets missed) is to configure timeout and window setting to accommodate network latency and segment settings. There is also a separate webinar on this topic and you can surely get in touch with us to find out more details on how to configure it correctly.

Q. Can we add a read-only node with PXC?

A. You can simply mark the selected nodes as super_read_only (or read_only). Replication continues as normal but direct traffic is blocked.

Q. Does the ProxySQL impact performance?

A. Using all of ProxySQL’s features gives you a huge performance improvement. Here is the sample use case.

Q. I have not had “excellent” results with Drupal. (e.g., clearing cache sometimes causes corruption, although i ensure all tables do have a primary key). Any advice on its suitability? I am currently using proxySQL with a single percona (non-cluster) 5.7 but would like to try again with PXC if advisable.

A. Not sure what exact problem you faced, but you may want to check this variable and articles around it wsrep_drupal_282555_workaround.

Q. Another question (to queue up as you are able to answer if possible): do you recommend SSL between ProxySQL — and specifically, what are the performance impacts, especially if there’s some latency between proxySQL and master percona db for writes?

A. We recommend SSL for security reasons, but it depends on the individual setup. Currently, ProxySQL does not support SSL from frontends. This feature is only available since 2.0.

Q1. Can i put two PXC clusters in master-slave replication mode with automatic failover?

Q2. How can i setup two PXC clusters in master-slave replication model with automatic failover?

A. You can have async master-slave replication link among two PXC clusters, but automatic failover of the node (if the acting master from cluster-1 fails then another active nodes of the cluster takes over as master) is currently not supported.

Q. Can the replication be done from ProxySQL level, so that if one node goes down in slave PXC, another node in PXC will take over the slave role?

A. This feature is not supported through ProxySQL. You can monitor replication lag through ProxySQL.

Q. Suppose if i have 5 node cluster in DC1 & DC2, how can we make transaction successful as soon as nodes in DC1 are committed rather than waiting for certification from nodes in DC2?

A. Given the transaction is executed on the local node and during commit (as a pre-commit stage) it is replicated (replication action doesn’t include certification and commit) to the other nodes of the cluster. Once replicated each node can parallelly certify, apply and commit the transaction. So this effectively means a transaction doesn’t need to be certified on all the nodes of the cluster before communicating commit success to end-user. Once the transaction is replicated, originating node can complete the local commit and communicate success to the application.

Q. Hi, thank you for the webinar is ProxySQL support HA, is it a single point of failure?

A. ProxySQL supports Native Clustering, thereby forming a ProxySQL cluster (vs. a single ProxySQL node) and in turn helps avoid a single point failure.

Q.  What is a good setup you will recommend, make proxySQL on some other server/vm or on the same as one of PXC nodes?

A. We would recommend installing ProxySQL on an independent node (or share with other applications). We don’t recommend installing ProxySQL on a PXC node. If the node hosting PXC and ProxySQL goes down (network or power failure), even though the cluster is working, the application will still lose connectivity as the ProxySQL gateway goes down as well.

Q. Let’s say we have three nodes, good quorum, what happened when one node goes down for maintenance what happens to the quorum since only two nodes now?

A. Two nodes can still form the quorum and continue servicing the workload.

Q If the transaction is not committed to all the nodes then will the cluster remains locked for read too?

A. No. The transaction commit is independent of a read action. “transaction commit” can continue in the background and the user can continue to read from the cluster node. If a user has configured wsrep_sync_wait, which effectively means wait for a transaction to get committed to fetch updated data only, then the read may wait for transaction commit to complete.

Q. Is there a way to do partitioning over data? To not have 100% replicate in each master?

A. PXC/Galera, being a multi-master solution, doesn’t recommend unsync data nodes. As an end-user you can still achieve it by setting wsrep_on=off -> execute a workload (this will not be replicated on the cluster) -> wsrep_on=on (all action post this point will again follow replication). This can lead to data inconsistency, though,  and shutdown of the cluster if the workload or action are not properly segregated – so not-recommended.

Q. Are changes done by triggers rollbackable?

A. Yes, they are.

Q. Does ProxySQL prevent “mysql server gone away” in mostly idle daemons?

A. ProxySQL Monitor Module regularly probes the backend nodes and marks the node as OFFLINE in the ProxySQL database if MySQL server is down.

Q. Can proxysql cache rules use regexes?

A. We can use regex with ProxySQL query rules. Go here for more info.

Q. Can PMM be used in Digitalocean droplets?

A. Yes.

Q. In regards to PXC, how much delay is introduced when data is written since it has to appear on all nodes?

A. When a user initiates a transaction on given node (let’s call it an originating node), then it is first applied (not committed) on the said node and a binary write-set is created. This write-set is then replicated on other nodes of the cluster. Once the replication is successful, each node can independently certify, apply and commit the transaction. Since originating node has already applied the transaction, it just needs to certify and commit the transaction. But it is interesting to note that the apply stage on the other replicated node is fast too, given that the transaction is now packed in a database optimized apply format. In short, there would be no delay (or marginal delay). Delay could be higher if the transaction is a huge transaction, as the apply stage could take time. That is one of the reasons Galera doesn’t recommend huge transactions.

Q. How does PXC (Percona XtraDB Cluster) allow DDL (schema changes) on one server with DML on the same table on another server? (This can break MySQL Master-Master replication)?

A. PXC executes DDL using the TOI (total order isolated) protocol. In short, while DDL is executing it takes complete control of the node (no other parallel DML or DDL is allowed). DDL executes at the same position on all then does.

Q. Can ProxySQL split read-write queries based on stored procedure names (patterns)? e.g. sp_write vs sp_read?

A. ProxySQL read/write split is based on mysql_query_rules and hostgroups. For more info.

Q. Can we use ProxySQL with a single node for the query caching feature? Especially since query cache will be discontinued in MySQL 8?

A. If you configure Query Cache properly, you can cache queries for a single node. 

Q. Must binary logging be enabled for ProxySQL / PXC to work?

A. PXC replicates write-sets. While binary logging is not needed, PXC still needs these write-sets that are generated using binary logging module so PXC can then enable emulation based bin logs for a generation of these write-sets (persistence to disk is not needed). If disk space is not a constraint, we recommend you enable binary logging.

Q. Please define Galera and Percona, as well as the relationship between the two?

A. Galera is replication technology owned and developed by Codership and distributed under GPL license. Percona has adopted the said technology along with its Percona Server for MySQL and build PXC. Percona continues to refresh updates made to Galera and related wsrep-plugin on a regular basis. At the same time, Percona also continues to refresh from Percona-Server for MySQL for related enhancement and bug fixes.

Q. Is the ProxySql Admin tool the script/tool that you mentioned would autodetect your existing PXC or there’s a different script? Trying to know if you need to have the PXC and ProxySQL installed at the same time?

With ProxySQL, do we need to wait for active threads on the PXC to drain before shutting down the PXC?

A. ProxySQL Admin (proxysql-admin) script helps you configure your PXC nodes to ProxySQL database. PXC and ProxySQL should be up and running to initiate proxysql-admin script. For more info.

If you trigger PXC node shutdown proxysql_galera_checker script marks the node as offline in the ProxySQL DB, and new connections aren’t redirected to the offline node.

Q. 1) HAProxy and ProxySQL: which one has the better performance when the number of Clusters is large? Up to 30 Clusters?
2) What´s the better tool to monitoring a large number of clusters and nodes?

A. For PXC, we strongly recommend ProxySQL as it is closely integrated with PXC. HAProxy works with PXC as well, and before ProxySQL we had customers using it. For a quick comparison, you can take a look at following article.

Q. When the PXC settings have a maximum connection how does ProxySQL allow for much more than the standard connections?

A. ProxySQL terminates the connection with a connection timeout error.

FATAL: `thread_run' function failed: /usr/share/sysbench/oltp_insert.lua:61: SQL error, errno = 9001, state = 'HY000': Max connect timeout reached while reaching hostgroup 10 after 10012ms


Once again, thanks for your questions and queries. If you still have more questions or need clarification, you can log them at the percona-xtradb-cluster forum. We would also like to know what else you expect from Percona XtraDB Cluster in upcoming releases.

The post Q&A: “Percona XtraDB Cluster 5.7 and ProxySQL for Your High Availability Needs” Webinar appeared first on Percona Database Performance Blog.

Causes and Workarounds for Slave Performance Too Slow with Row-Based Events

May 3, 2018 - 12:09pm

Recently I worked on one customer issue that I would describe as “slave performance too slow”. During a quick analysis, I’ve found that the replication slave SQL thread cannot keep up while processing row-based events from the master’s binary log.

For example:

mysql> SHOW SLAVE STATUSG *************************** 1. row *************************** ... Master_Log_File: binlog.0000185 Read_Master_Log_Pos: 86698585 ... Relay_Master_Log_File: binlog.0000185 Slave_IO_Running: Yes Slave_SQL_Running: Yes ... Exec_Master_Log_Pos: 380 Relay_Log_Space: 85699128 ... Master_UUID: 98974e7f-2fbc-18e9-72cd-07003817585c ... Retrieved_Gtid_Set: 98974e7f-2fbc-18e9-72cd-07003817585c:1055-1057 Executed_Gtid_Set: 7f42e2c5-3fbc-16e7-7fb8-05003715789a:1-2, 98974e7f-2fbc-18e9-72cd-07003817585c:1-1056 ...

The processlist state for the SQL thread can be one of the following: Reading event from the relay log, or System lock, or potentially some other state. In my case:

mysql> SHOW PROCESSLIST; +----+-----------------+-----------------+------+---------+------+----------------------------------+------------------+ | Id | User | Host | db | Command | Time | State | Info | +----+-----------------+-----------------+------+---------+------+----------------------------------+------------------+ ... | 4 | system user | | NULL | Connect | 268 | Reading event from the relay log | NULL | ... +----+-----------------+-----------------+------+---------+------+----------------------------------+------------------+

What causes that?

Let’s take a look what could potentially cause such behavior and what we need to pay attention to. When the SQL thread applies the change from a row-based event, it has to locate the exact row that was updated. With a primary key, this is trivial as only one row can possibly have the same value for the primary key.

However, if there is no primary key on the table on the replication slave side, the SQL thread must search the entire table to locate the row to update or delete. It repeats the search for each updated row. This search is both very resource usage intensive (CPU usage can be up to 100%) and slow causing the slave to fall behind.

For InnoDB tables, the “hidden” key used for the clustered index for tables without a primary key cannot be used to avoid searching the entire table for the rows to update or delete. We need to keep in mind that the “hidden” key is unique only to each MySQL instance, so the replication master and replication slave generally don’t have the same values for the “hidden” key for the same row. What can we do to solve that?

The best solution is to ensure that all tables have a primary key. This not only ensures the SQL thread can easily locate rows to update or delete, but it is also considered as a best practice since it ensures all rows are unique.

If there is no way to logically add a natural primary key for the table, a potential solution is to add an auto-increment unsigned integer column as the primary key.

The query below helps you to locate tables without a primary key:

SELECT tables.table_schema, tables.table_name, tables.table_rows FROM information_schema.tables LEFT JOIN ( SELECT table_schema, table_name FROM information_schema.statistics GROUP BY table_schema, table_name, index_name HAVING SUM( CASE WHEN non_unique = 0 AND nullable != 'YES' THEN 1 ELSE 0 END ) = COUNT(*) ) puks ON tables.table_schema = puks.table_schema AND tables.table_name = puks.table_name WHERE puks.table_name IS NULL AND tables.table_schema NOT IN ('mysql', 'information_schema', 'performance_schema', 'sys') AND tables.table_type = 'BASE TABLE' AND engine='InnoDB';

Please note that for InnoDB, there must always be a unique NOT NULL key for all tables. It is required for the clustered index. So adding an explicit “dummy” column as suggested above will not add to the overall storage requirements as it will merely replace the hidden key.

It’s not always possible to add a primary key to the table immediately if, for example, there are many relations on the application side/legacy system, lack of resources, unknown application behavior after the change which required testing, etc.

In this case, a short-term solution is to change the search algorithm used by the replication slave to locate the rows changed by row-based events.

The search algorithm is set using the slave_rows_search_algorithms option which is available in MySQL 5.6 and later. The default value is to use an index scan if possible, otherwise a table scan.

However, for tables without a primary key using a hash scan, which causes the SQL thread to temporarily cache hashes to reduce the overhead of searching the whole table. The value of slave_rows_search_algorithms can be changed dynamically using:

mysql> SET GLOBAL slave_rows_search_algorithms = 'INDEX_SCAN,HASH_SCAN';
Just to note INDEX_SCAN,HASH_SCAN is the default value in MySQL 8.0.

One thing to be aware of when using hash scans that the hashes are only reused within one row-based event. (Each row-based event may have changes to several rows in the same table originating from the same SQL statement).

The binlog_row_event_max_size option on the replication master controls the maximum size of a row-based event. The default max event size is 8kB. This means that switching to hash scans only improves the performance of the SQL thread when:

  1. Several rows fit into one row based event. It may help to increase the value of binlog_row_event_max_size on the replication master, if you perform updates or deletes on large rows (e.g., with blob or text data). You can only set the binlog_row_event_max_size in the MySQL configuration file, and resetting this value requires a restart.
  2. One statement changes several rows.


Even if enabling hash scans improves the performance enough for the replication slave to keep up, the permanent solution is to add an explicit primary key to each table. This should be the general rule of thumb in the schema design in order avoid and/or minimize many issues like slave performance too slow (as described in this post).

Next, I am going to investigate how we can find out the exact thread state using Performance Schema in order to make issue identification less of a guessing game.

The post Causes and Workarounds for Slave Performance Too Slow with Row-Based Events appeared first on Percona Database Performance Blog.

ProxySQL Query Rewrite Use Case

May 2, 2018 - 4:02pm

In this blog post, I’m going to revisit the ProxySQL Query Rewrite feature. You may have seen me talking about possible use case scenarios in the past few conferences, but the reason I’m starting with this is that query rewriting was the original intention for building ProxySQL.

Why would you need to rewrite a query?

  • You’ve identified a query that’s causing bottleneck or slowness
  • A special operation requires query routing
  • You cannot modify application code

So here we have a case of a bad query hitting the backend database. You as a DBA have identified the query as causing severe slowdown, which could lead to a site-wide outage. This query needs to be optimized, and you have asked the developer to correct this bad query. Their answer isn’t really what you expected. You can rewrite some queries to have the same data result by choosing a different optimizer path. In cases where an application was written in ORM – such as Hibernate or similar – it is not easy to quickly make a code change.

The query rewrite feature of ProxySQL makes this possible (until the application can be modified).

How do we rewrite a query? There are two ways to accomplish this with ProxySQL.

Query rewrite is just a match_pattern + replace_pattern activity, whereas match_digest is only used for matching a query, not rewriting it. Logically, match_digest serves the same purpose of username, schemaname, proxy_addr, etc. It only matches the query.

These two different mechanisms offers ways to optimize query matching operation efficiently depending on the query type (such as DML operation versus SELECT query). Please note that if your intention is to rewrite queries, the rule must match the original query by using match_pattern. Query rules are processed by using rule_id field and only applied if active = 1.

Here’s how we can demonstrate match_digest in our test lab:

mysql> SELECT hostgroup hg, sum_time, count_star, digest_text FROM stats_mysql_query_digest ORDER BY sum_time DESC limit 10; +----+-----------+------------+-----------------------------------+ | hg | sum_time | count_star | digest_text | +----+-----------+------------+-----------------------------------+ | 0 | 243549572 | 85710 | SELECT c FROM sbtest10 WHERE id=? | | 0 | 146324255 | 42856 | COMMIT | | 0 | 126643488 | 44310 | SELECT c FROM sbtest7 WHERE id=? | | 0 | 126517140 | 42927 | BEGIN | | 0 | 123797307 | 43820 | SELECT c FROM sbtest1 WHERE id=? | | 0 | 123345775 | 43460 | SELECT c FROM sbtest6 WHERE id=? | | 0 | 122121030 | 43010 | SELECT c FROM sbtest9 WHERE id=? | | 0 | 121245265 | 42400 | SELECT c FROM sbtest8 WHERE id=? | | 0 | 120554811 | 42520 | SELECT c FROM sbtest3 WHERE id=? | | 0 | 119244143 | 42070 | SELECT c FROM sbtest5 WHERE id=? | +----+-----------+------------+-----------------------------------+ 10 rows in set (0.00 sec) mysql> INSERT INTO mysql_query_rules (rule_id,active,username,match_digest, match_pattern,replace_pattern,apply) VALUES (10,1,'root','SELECT.*WHERE id=?','sbtest2','sbtest10',1); Query OK, 1 row affected (0.00 sec) mysql> LOAD MYSQL QUERY RULES TO RUNTIME; Query OK, 0 rows affected (0.00 sec) mysql> SELECT hits, mysql_query_rules.rule_id,digest,active,username, match_digest, match_pattern, replace_pattern, cache_ttl, apply FROM mysql_query_rules NATURAL JOIN stats.stats_mysql_query_rules ORDER BY mysql_query_rules.rule_id; +------+---------+--------+--------+----------+--------------------+---------------+-----------------+-----------+-------+ | hits | rule_id | digest | active | username | match_digest | match_pattern | replace_pattern | cache_ttl | apply | +------+---------+--------+--------+----------+--------------------+---------------+-----------------+-----------+-------+ | 0 | 10 | NULL | 1 | root | SELECT.*WHERE id=? | sbtest2 | sbtest10 | NULL | 1 | +------+---------+--------+--------+----------+--------------------+---------------+-----------------+-----------+-------+ 1 row in set (0.00 sec) mysql> SELECT hits, mysql_query_rules.rule_id,digest,active,username, match_digest, match_pattern, replace_pattern, cache_ttl, apply FROM mysql_query_rules NATURAL JOIN stats.stats_mysql_query_rules ORDER BY mysql_query_rules.rule_id; +------+---------+--------+--------+----------+--------------------+---------------+-----------------+-----------+-------+ | hits | rule_id | digest | active | username | match_digest | match_pattern | replace_pattern | cache_ttl | apply | +------+---------+--------+--------+----------+--------------------+---------------+-----------------+-----------+-------+ | 593 | 10 | NULL | 1 | root | SELECT.*WHERE id=? | sbtest2 | sbtest10 | NULL | 1 | +------+---------+--------+--------+----------+--------------------+---------------+-----------------+-----------+-------+ 1 row in set (0.00 sec)

We can also monitor Query Rules activity live using the ProxyTop utility:

To reset ProxySQL’s statistics for query rules, use following steps:

mysql> SELECT 1 FROM stats_mysql_query_digest_reset LIMIT 1; +---+ | 1 | +---+ | 1 | +---+ 1 row in set (0.01 sec) mysql> LOAD MYSQL QUERY RULES TO RUNTIME; Query OK, 0 rows affected (0.00 sec)

Here’s a match_pattern example:

mysql> SELECT hostgroup hg, sum_time, count_star, digest_text FROM stats_mysql_query_digest ORDER BY sum_time DESC limit 5; +----+----------+------------+----------------------------------+ | hg | sum_time | count_star | digest_text | +----+----------+------------+----------------------------------+ | 0 | 98753983 | 16292 | BEGIN | | 0 | 84613532 | 16232 | COMMIT | | 1 | 49327292 | 16556 | SELECT c FROM sbtest3 WHERE id=? | | 1 | 49027118 | 16706 | SELECT c FROM sbtest2 WHERE id=? | | 1 | 48095847 | 16396 | SELECT c FROM sbtest4 WHERE id=? | +----+----------+------------+----------------------------------+ 5 rows in set (0.01 sec) mysql> INSERT INTO mysql_query_rules (rule_id,active,username,match_pattern,replace_pattern,apply) VALUES (20,1,'root','DISTINCT(.*)ORDER BY c','DISTINCT1',1); Query OK, 1 row affected (0.00 sec) mysql> LOAD MYSQL QUERY RULES TO RUNTIME; Query OK, 0 rows affected (0.01 sec) mysql> SELECT hits, mysql_query_rules.rule_id,digest,active,username, match_digest, match_pattern, replace_pattern, cache_ttl, apply FROM mysql_query_rules NATURAL JOIN stats.stats_mysql_query_rules ORDER BY mysql_query_rules.rule_id; +------+---------+--------+--------+----------+--------------------+------------------------+-----------------+-----------+-------+ | hits | rule_id | digest | active | username | match_digest | match_pattern | replace_pattern | cache_ttl | apply | +------+---------+--------+--------+----------+--------------------+------------------------+-----------------+-----------+-------+ | 0 | 10 | NULL | 1 | root | SELECT.*WHERE id=? | sbtest2 | sbtest10 | NULL | 1 | | 0 | 20 | NULL | 1 | root | NULL | DISTINCT(.*)ORDER BY c | DISTINCT1 | NULL | 1 | +------+---------+--------+--------+----------+--------------------+------------------------+-----------------+-----------+-------+ 2 rows in set (0.01 sec) mysql> SELECT hits, mysql_query_rules.rule_id,digest,active,username, match_digest, match_pattern, replace_pattern, cache_ttl, apply FROM mysql_query_rules NATURAL JOIN stats.stats_mysql_query_rules ORDER BY mysql_query_rules.rule_id; +------+---------+--------+--------+----------+--------------------+------------------------+-----------------+-----------+-------+ | hits | rule_id | digest | active | username | match_digest | match_pattern | replace_pattern | cache_ttl | apply | +------+---------+--------+--------+----------+--------------------+------------------------+-----------------+-----------+-------+ | 9994 | 10 | NULL | 1 | root | SELECT.*WHERE id=? | sbtest2 | sbtest10 | NULL | 1 | | 6487 | 20 | NULL | 1 | root | NULL | DISTINCT(.*)ORDER BY c | DISTINCT1 | NULL | 1 | +------+---------+--------+--------+----------+--------------------+------------------------+-----------------+-----------+-------+ 2 rows in set (0.00 sec) mysql> SELECT 1 FROM stats_mysql_query_digest_reset LIMIT 1; +---+ | 1 | +---+ | 1 | +---+ 1 row in set (0.00 sec) mysql> LOAD MYSQL QUERY RULES TO RUNTIME; Query OK, 0 rows affected (0.00 sec)

The key in query ruling for a rewrite is the order of the apply field:

  • apply = 1 means don’t evaluate any other rules if there’s a match already.
  • apply = 0 means evaluate the next rules in the chain.

As we can see in the test below, all queries matching with rule_id = 10 or rule_id = 20 have hits. In reality, all rules in runtime_mysql_query_rules are active. If we want to disable a rule that is in the mysql_query_rules table, set active = 0:

mysql> update mysql_query_rules set apply = 1 where rule_id in (10); Query OK, 1 row affected (0.00 sec) mysql> update mysql_query_rules set apply = 0 where rule_id in (20); Query OK, 1 row affected (0.00 sec) mysql> LOAD MYSQL QUERY RULES TO RUNTIME; Query OK, 0 rows affected (0.00 sec) mysql> SELECT hits, mysql_query_rules.rule_id,digest,active,username, match_digest, match_pattern, replace_pattern, cache_ttl, apply FROM mysql_query_rules NATURAL JOIN stats.stats_mysql_query_rules ORDER BY mysql_query_rules.rule_id; +------+---------+--------+--------+----------+--------------------+------------------------+-----------------+-----------+-------+ | hits | rule_id | digest | active | username | match_digest | match_pattern | replace_pattern | cache_ttl | apply | +------+---------+--------+--------+----------+--------------------+------------------------+-----------------+-----------+-------+ | 0 | 10 | NULL | 1 | root | SELECT.*WHERE id=? | sbtest2 | sbtest10 | NULL | 1 | | 0 | 20 | NULL | 1 | root | NULL | DISTINCT(.*)ORDER BY c | DISTINCT1 | NULL | 0 | +------+---------+--------+--------+----------+--------------------+------------------------+-----------------+-----------+-------+ 2 rows in set (0.00 sec) mysql> SELECT hits, mysql_query_rules.rule_id,digest,active,username, match_digest, match_pattern, replace_pattern, flagIN, apply FROM mysql_query_rules NATURAL JOIN stats.stats_mysql_query_rules ORDER BY mysql_query_rules.rule_id; +-------+---------+--------+--------+----------+--------------------+------------------------+-----------------+--------+-------+ | hits | rule_id | digest | active | username | match_digest | match_pattern | replace_pattern | flagIN | apply | +-------+---------+--------+--------+----------+--------------------+------------------------+-----------------+--------+-------+ | 10195 | 10 | NULL | 1 | root | SELECT.*WHERE id=? | sbtest2 | sbtest10 | 0 | 1 | | 6599 | 20 | NULL | 1 | root | NULL | DISTINCT(.*)ORDER BY c | DISTINCT1 | 0 | 0 | +-------+---------+--------+--------+----------+--------------------+------------------------+-----------------+--------+-------+ 2 rows in set (0.00 sec) mysql> SELECT hits, mysql_query_rules.rule_id,digest,active,username, match_digest, match_pattern, replace_pattern, flagIN, apply FROM mysql_query_rules NATURAL JOIN stats.stats_mysql_query_rules ORDER BY mysql_query_rules.rule_id; +-------+---------+--------+--------+----------+--------------------+------------------------+-----------------+--------+-------+ | hits | rule_id | digest | active | username | match_digest | match_pattern | replace_pattern | flagIN | apply | +-------+---------+--------+--------+----------+--------------------+------------------------+-----------------+--------+-------+ | 20217 | 5 | NULL | 1 | root | NULL | DISTINCT(.*)ORDER BY c | DISTINCT1 | 0 | 1 | | 27020 | 10 | NULL | 1 | root | SELECT.*WHERE id=? | sbtest2 | sbtest10 | 0 | 0 | +-------+---------+--------+--------+----------+--------------------+------------------------+-----------------+--------+-------+ 2 rows in set (0.00 sec) mysql> update mysql_query_rules set active = 0 where rule_id = 5; Query OK, 1 row affected (0.00 sec) mysql> LOAD MYSQL QUERY RULES TO RUNTIME; Query OK, 0 rows affected (0.02 sec) mysql> SELECT hits, mysql_query_rules.rule_id,digest,active,username, match_digest, match_pattern, replace_pattern, cache_ttl, apply FROM mysql_query_rules NATURAL JOIN stats.stats_mysql_query_rules ORDER BY mysql_query_rules.rule_id; +------+---------+--------+--------+----------+--------------------+---------------+-----------------+-----------+-------+ | hits | rule_id | digest | active | username | match_digest | match_pattern | replace_pattern | cache_ttl | apply | +------+---------+--------+--------+----------+--------------------+---------------+-----------------+-----------+-------+ | 0 | 10 | NULL | 1 | root | SELECT.*WHERE id=? | sbtest2 | sbtest10 | NULL | 0 | +------+---------+--------+--------+----------+--------------------+---------------+-----------------+-----------+-------+ 1 row in set (0.00 sec) mysql> SELECT hits, mysql_query_rules.rule_id,digest,active,username, match_digest, match_pattern, replace_pattern, cache_ttl, apply FROM mysql_query_rules NATURAL JOIN stats.stats_mysql_query_rules ORDER BY mysql_query_rules.rule_id; +------+---------+--------+--------+----------+--------------------+---------------+-----------------+-----------+-------+ | hits | rule_id | digest | active | username | match_digest | match_pattern | replace_pattern | cache_ttl | apply | +------+---------+--------+--------+----------+--------------------+---------------+-----------------+-----------+-------+ | 4224 | 10 | NULL | 1 | root | SELECT.*WHERE id=? | sbtest2 | sbtest10 | NULL | 0 | +------+---------+--------+--------+----------+--------------------+---------------+-----------------+-----------+-------+ 1 row in set (0.01 sec)

Additionally, ProxySQL can help to identify bad queries. Login to the admin module and follow these steps:

Find the most time-consuming queries:

mysql> SELECT SUM(sum_time), SUM(count_star), digest_text FROM stats_mysql_query_digest GROUP BY digest ORDER BY SUM(sum_time) DESC LIMIT 3G *************************** 1. row *************************** SUM(sum_time): 95053795 SUM(count_star): 13164 digest_text: BEGIN *************************** 2. row *************************** SUM(sum_time): 85094367 SUM(count_star): 13130 digest_text: COMMIT *************************** 3. row *************************** SUM(sum_time): 52110099 SUM(count_star): 13806 digest_text: SELECT c FROM sbtest3 WHERE id=? 3 rows in set (0.00 sec)

Find highest average execution time:

mysql> SELECT SUM(sum_time), SUM(count_star), SUM(sum_time)/SUM(count_star) avg, digest_text FROM stats_mysql_query_digest GROUP BY digest ORDER BY SUM(sum_time)/SUM(count_star) DESC limit 1; +---------------+-----------------+--------+--------------------------------+ | SUM(sum_time) | SUM(count_star) | avg | digest_text | +---------------+-----------------+--------+--------------------------------+ | 972162 | 1 | 972162 | CREATE INDEX k_5 ON sbtest5(k) | +---------------+-----------------+--------+--------------------------------+ 1 row in set (0.00 sec)

The above information can also be gathered from information_schema.events_statements_summary_by_digest, but I prefer the ProxySQL admin interface. Also, you can run the slow query log analysis by running a detailed pt-query-digest on your system to identify slow queries. You can also use PMM’s QAN.


I’ve found the best documentation on ProxySQL query rewrite is at IBM’s site, where they explain query rewrite fundamentals with examples. It’s worth a read. I’m not going to get into the details of these techniques here, but if you find more relevant resources, please post them in the comments section.

A few of the possible query optimization techniques:

  • Operation merging
  • Operation movement
  • Predicate translation

At the time of this blog post, ProxySQL has also announced a new fast schema routing algorithm to support thousands of shards.

There may be other cases where you want to divert traffic to another table. Think of a table hitting the maximum integer value, and you want to keep inserts going into a new table while you alter the old one to correct the issue. In the mean time, all selects can still point to the old table to continue operation.

As of MySQL 5.7.6, Oracle also offers query rewrite as a plugin, and you can find the documentation here. The biggest disadvantage of using Oracle’s built-in solution is the rewrite rule sits with the server it is implemented on. That’s where ProxySQL has a bigger advantage: it sits between the application and database server, so the rule applies to the entire topology, not just for a single host.

As you can see, ProxySQL query rewrite is a great way to solve some real operational issues and make you a hero to the team and project. To become a rock star, you might want to consider Percona Training on ProxySQL. The training will provide the knowledge to set up a ProxySQL environment with best practices, understand when and how to change the configuration, and maintain it to ensure increasing your uptime SLAs. Contact us for more details at


The post ProxySQL Query Rewrite Use Case appeared first on Percona Database Performance Blog.

MongoDB Rollback in replicaset

May 1, 2018 - 3:30pm

In this blog post, we’ll look at how MongoDB rollback works during replicaset failovers.

In recent versions, MongoDB has provided lots of features related to replicaset and automatic failover. When it comes to failover, the next question that arises is “How does MongoDB ROLLBACK work during replicaset failover?”

If a PRIMARY member (say node A) stepped down with some data writes that were executed but not replicated to the SECONDARY members yet, then a ROLLBACK occurs on the former PRIMARY A when it rejoins the replicaset. I’ll explain below how the ROLLBACK works!

ROLLBACK Scenario:

ROLLBACK is rare in a replicaset as MongoDB tries to avoid it by replicating the operations from PRIMARY to SECONDARY without delay, under normal conditions. Most of the time ROLLBACK occurs in the event of network partitioning, or if SECONDARY members can’t keep up with the throughput of operations on the former PRIMARY.


We will see the process with a test. I have used Docker for this test with the MongoDB 3.2 Jessie version to setup a replicaset with members mongo1 – A, mongo2 – B, mongo3 – C and set Priority 10 to A. Now A is PRIMARY as expected in the replicaset. We need to write some data into A and create a network partition scenario with B and C at the same time. For that, I inserted 25000 documents into A and made it out of network at the same time.

Terminal 1 (A’s mongo prompt):

my-mongo-set:PRIMARY> for (var i = 1; i <= 25000; i++) { ...    db.testData.insert( { x : i } ) ... } WriteResult({ "nInserted" : 1 }) my-mongo-set:PRIMARY> db.testD2018-03-30T17:34:51.455+0530 I NETWORK  [thread1] trying reconnect to ( failed 2018-03-30T17:34:51.464+0530 I NETWORK  [thread1] reconnect ( ok                       db.testD admin.testD my-mongo-set:SECONDARY> rs.slaveOk() my-mongo-set:SECONDARY> db.testData.count() 25000


Vinodhs-MBP:~ vinodhkrish$ docker ps CONTAINER ID        IMAGE                       COMMAND                  CREATED             STATUS              PORTS                           NAMES b27d82ac2439        mongo:3.2.19-jessie         "docker-entrypoint.s…"   2 days ago          Up 1 days >27017/tcp        mongo3 2b39f9e41973        mongo:3.2.19-jessie         "docker-entrypoint.s…"   2 days ago          Up 1 days >27017/tcp        mongo2 105b6df757d7        mongo:3.2.19-jessie         "docker-entrypoint.s…"   2 days ago          Up 1 days >27017/tcp        mongo1 Vinodhs-MBP:~ vinodhkrish$ docker network disconnect my-mongo-cluster mongo1

The member A has now become as SECONDARY, because it couldn’t reach other members in the replicaset. On the other side, B and C members see that A is not reachable and then B is elected as PRIMARY. We could see that some inserts from former A replicated to B before the network split happens.

(B node)

my-mongo-set:PRIMARY> db.testData.count() 15003

Now do some write operations in current PRIMARY – B and then let node A join the network back by joining the container back to the bridge network. You can observe below that the node A’s member states are changing in the mongo prompt. (I just connected to A and pressed ENTER/RETURN button many times to see the member states, or you can see them in the log file):

(A node)

Vinodhs-MacBook-Pro:mongodb-osx-x86_64-3.2.19 vinodhkrish$ ./bin/mongo MongoDB shell version: 3.2.19 connecting to: my-mongo-set:ROLLBACK>  my-mongo-set:RECOVERING>  my-mongo-set:SECONDARY>  my-mongo-set:SECONDARY>  my-mongo-set:PRIMARY>


From MongoDB point of view, we will see the replicaset process to understand what happened above. Normally the SECONDARY member syncs the oplog entries from its syncSource (the member from where the data is replicated) by using oplogFetcher. The OplogFetcher first sends a find() command to the syncSource’s oplog, and then follows with a series of getMores on the cursor. When node A rejoins the replicaset, node A’s oplogFetcher first sends find() command to syncSource node B and check it has a greater than or equal predicate on the timestamp of the last oplog entry it has fetched. Usually the find() command should at least return one doc due to the greater than or equal predicate. If not, it means that the syncSource is behind and so it will not replicate from it and look for other syncSource.

In this case, A’s oplogFetcher sees that the first document returned from node B does not match the last entry in its oplog. That means node A’s oplog has diverged from node B’s and it should go into ROLLBACK.

Node A first finds the common point between its oplog and its syncSource B’s oplog. It then goes through all of the operations in its oplog back to the common point and figures out how to undo them. Here, 9997 inserts are missed from B and C nodes, and so these documents will be recovered from A’s oplog.

2018-03-30T12:08:37.160+0000 I REPL     [rsBackgroundSync] Starting rollback due to OplogStartMissing: our last op time fetched: (term: 4, timestamp: Mar 30 12:03:52:139). source's GTE: (term: 5, timestamp: Mar 30 12:05:37:1) hashes: (3789163619674410187/3226093795606474294) 2018-03-30T12:08:37.160+0000 I REPL     [rsBackgroundSync] rollback 0 2018-03-30T12:08:37.160+0000 I REPL     [ReplicationExecutor] transition to ROLLBACK 2018-03-30T12:08:37.163+0000 I REPL     [rsBackgroundSync] beginning rollback 2018-03-30T12:08:37.163+0000 I REPL     [rsBackgroundSync] rollback 1 2018-03-30T12:08:37.164+0000 I REPL     [rsBackgroundSync] rollback 2 FindCommonPoint 2018-03-30T12:08:37.166+0000 I REPL     [rsBackgroundSync] rollback our last optime:   Mar 30 12:03:52:139 2018-03-30T12:08:37.166+0000 I REPL     [rsBackgroundSync] rollback their last optime: Mar 30 12:08:17:1c5 2018-03-30T12:08:37.166+0000 I REPL     [rsBackgroundSync] rollback diff in end of log times: -265 seconds 2018-03-30T12:08:37.269+0000 I REPL     [rsBackgroundSync] rollback common point is (term: 4, timestamp: Mar 30 12:03:46:d2) 2018-03-30T12:08:37.269+0000 I REPL     [rsBackgroundSync] rollback 3 fixup 2018-03-30T12:08:38.240+0000 I REPL     [rsBackgroundSync] rollback 3.5 2018-03-30T12:08:38.240+0000 I REPL     [rsBackgroundSync] Setting minvalid to (term: 5, timestamp: Mar 30 12:08:17:1c5) 2018-03-30T12:08:38.241+0000 I REPL     [rsBackgroundSync] rollback 4 n:1 2018-03-30T12:08:38.241+0000 I REPL     [rsBackgroundSync] rollback 4.6 2018-03-30T12:08:38.241+0000 I REPL     [rsBackgroundSync] rollback 4.7 2018-03-30T12:08:38.391+0000 I REPL     [rsBackgroundSync] rollback 5 d:9997 u:0 2018-03-30T12:08:38.391+0000 I REPL     [rsBackgroundSync] rollback 6 2018-03-30T12:08:38.394+0000 I REPL     [rsBackgroundSync] rollback done 2018-03-30T12:08:38.396+0000 I REPL     [rsBackgroundSync] rollback finished


Where would these 9997 recovered documents go? MongoDB writes these ROLLBACK documents under the rollback directory in the dbpath. These recovered collections are named with namespace as the prefix and the date time as the suffix in their names. These are in BSON format, and we need to convert into JSON to analyze them so the plan for the next course of action can be done. In our case, the testData collection’s rollback data are as follows:

root@105b6df757d7:/# cd /data/db root@105b6df757d7:/data/db# ls -l rollback/ total 324K -rw-r--r-- 1 mongodb mongodb 323K Mar 30 12:08 admin.testData.2018-03-30T12-08-38.0.bson

root@105b6df757d7:/data/db/rollback# bsondump admin.testData.2018-03-30T12-08-38.0.bson > rollback.json 2018-03-30T12:13:00.033+0000 9997 objects found root@105b6df757d7:/data/db/rollback# head rollback.json {"_id":{"$oid":"5abe279f97044083811b5975"},"x":15004.0} {"_id":{"$oid":"5abe279f97044083811b5976"},"x":15005.0} {"_id":{"$oid":"5abe279f97044083811b5977"},"x":15006.0} {"_id":{"$oid":"5abe279f97044083811b5978"},"x":15007.0} {"_id":{"$oid":"5abe279f97044083811b5979"},"x":15008.0} {"_id":{"$oid":"5abe279f97044083811b5980"},"x":15009.0} {"_id":{"$oid":"5abe279f97044083811b5981"},"x":15010.0}

That’s it? Now check the counts of the testData collection in node A:

my-mongo-set:PRIMARY> db.testData.count() 15003

So the records 9997 which were rollbacked into the rollback directory would also be dropped from the collection. This ensures the data consistency throughout the replicaset.

How to avoid ROLLBACK – writeConcern

The default writeConcern in the replicaSet is w:1., i.e., When a client writes into a replicaSet, then it receives an acknowledgment from the PRIMARY alone and won’t wait for SECONDARY members’ acknowledgment. If you want to avoid the ROLLBACK scenario in your environment, then you have to use the {w:majority} or {w:n}, where 1 > n <=  (no. of members in your replica set). This ensures that the writes are propagated to so many members of the replica set before sending the acknowledgment to the client. This solves the problem of ROLLBACK.

But please be careful that you are not giving higher value to writeConcern, because it also affects the write performance. The acknowledgment needs to be received from the number of members mentioned in the value. The value {w:majority} provides the acknowledgement that write operations have propagated to the majority of voting nodes, including the primary and is suitable for most of the environments.

ROLLBACK – Limitation

The main thing to note here is that mongod will not rollback more than 300MB data. In such cases, we need to manually check the instance to recover the data. You can see the below message in mongod.log in such cases:

[replica set sync] replSet syncThread: 13410 replSet too much data to roll back

Understanding this simple ROLLBACK background helps us to decide what needs to be done with the rollbacked data. It also helps us avoid such scenarios, because data is data and is very important!

The post MongoDB Rollback in replicaset appeared first on Percona Database Performance Blog.

Webinar Thursday May 3, 2018: Running MongoDB in Production (Part 3)

May 1, 2018 - 1:29pm

Please join Percona’s Senior Technical Operations Architect, Tim Vaillancourt as he presents Running MongoDB in Production (Part 3) on Thursday, May 3, 2018, at 10:00 am PDT (UTC-7) / 1:00 pm EDT (UTC-4).

Are you a seasoned MySQL DBA that needs to add MongoDB to your skills? Are you used to managing a small environment that runs well, but want to know what you might not know yet?

MongoDB works well, but when it has issues the number one question is “where should I go to solve a problem?”

This webinar on running MongoDB covers:

  • Troubleshooting
    • Log File
    • Slow Query
    • Operations
  • Schema Design
    • Data Types
    • Indexes
    • Workflows
  • Data Integrity
    • Replica Sets
    • Write Concerns
    • Data Recovery
  • Scaling (Read/Writes)

Register for the webinar now.

Missed Part 1 and Part 2 of our Running MongoDB in Production series? You can watch and download the slides of Part 1 here and watch or download the slides of Part 2 here.

Timothy Vaillancourt, Senior Technical Operations Architect

Tim joined Percona in 2016 as Sr. Technical Operations Architect for MongoDB with the goal of making MongoDB operations as smooth as possible. With experience operating infrastructures in industries such as government, online marketing/publishing, SaaS and gaming, combined with experience tuning systems from the hard disk all the way up to the end-user, Tim has spent time in nearly every area of the modern IT stack with many lessons learned.

Tim is based in Amsterdam, NL and enjoys traveling, coding and music. Prior to Percona Tim was the Lead MySQL DBA of Electronic Arts’ DICE studios, helping some of the largest games in the world (“Battlefield” series, “Mirrors Edge” series, “Star Wars: Battlefront”) launch and operate smoothly while also leading the automation of MongoDB deployments for EA systems. Before the role of DBA at EA’s DICE studio, Tim served as a subject matter expert in NoSQL databases, queues and search on the Online Operations team at EA SPORTS. Prior to moving to the gaming industry, Tim served as a Database/Systems Admin operating a large MySQL-based SaaS infrastructure at AbeBooks/Amazon Inc.

The post Webinar Thursday May 3, 2018: Running MongoDB in Production (Part 3) appeared first on Percona Database Performance Blog.

ClickHouse Meetup in Salt Lake City

May 1, 2018 - 12:07pm

Join Percona CTO Vadim Tkachenko at the Cloud Native Utah meetup in Salt Lake City on Tuesday, May 8, 2018, for an Intro to ClickHouse.

Next week, I’ll be switching from MyRocks performance testing and present an introduction to ClickHouse to the Cloud Native Utah meetup.

Interestingly enough, even though it is totally different from OLTP engines, ClickHouse uses a MergeTree engine. MergeTree engines have a lot of similarities with Log Structured Merge Tree (which is what is used by MyRocks / RocksDB). This the structure is optimized to run on huge datasets / low memory scenarios.

PingCAP TiDB and CockroachDB – the new databases on the block – are using RocksDB as the main storage engine. So is Log Structured Merge Tree the future of databases?

We can talk about this and other questions next week in Salt Lake City. If you are in town please join us at the Cloud Native Utah meetup.


The post ClickHouse Meetup in Salt Lake City appeared first on Percona Database Performance Blog.

A Look at MyRocks Performance

April 30, 2018 - 4:35pm

In this blog post, I’ll look at MyRocks performance through some benchmark testing.

As the MyRocks storage engine (based on the RocksDB key-value store ) is now available as part of Percona Server for MySQL 5.7, I wanted to take a look at how it performs on a relatively high-end server and SSD storage. I wanted to check how it performs for different amounts of available memory for the given database size. This is similar to the benchmark I published a while ago for InnoDB (

In this case, I plan to use a sysbench-tpcc benchmark ( and I will execute it for both MyRocks and InnoDB. We’ll use InnoDB as a baseline.

For the benchmark, I will use 100 TPC-C warehouses, with a set of 10 tables (to shift the bottleneck from row contention). This should give roughly 90GB of data size (when loaded into InnoDB) and is a roughly equivalent to 1000 warehouses data size.

To vary the memory size, I will change innodb_buffer_pool_size from 5GB to 100GB for InnoDB, and rocksdb_block_cache_size for MyRocks.

For MyRocks we will use LZ4 as the default compression on disk. The data size in the MyRocks storage engine is 21GB. Interesting to note, that in MyRocks uncompressed size is 70GB on the storage.

For both engines, I did not use FOREIGN KEYS, as MyRocks does not support it at the moment.

MyRocks does not support SELECT .. FOR UPDATE statements in REPEATABLE-READ mode in the Percona Server for MySQL implementation. However, “SELECT .. FOR UPDATE” is used in this benchmark. So I had to use READ-COMMITTED mode, which is supported.

The most important setting I used was to enable binary logs, for the following reasons:

  1. Any serious production uses binary logs
  2. With disabled binary logs, MyRocks is affected by a suboptimal transaction coordinator

I used the following settings for binary logs:

  • binlog_format = ‘ROW’
  • binlog_row_image=minimal
  • sync_binlog=10000 (I am not using 0, as this causes serious stalls during binary log rotations, when the  content of binary log is flushed to storage all at once)

While I am not a full expert in MyRocks tuning yet, I used recommendations from this page: The Facebook-MyRocks engineering team also provided me input on the best settings for MyRocks.

Let’s review the results for different memory sizes.

This first chart shows throughput jitter. This helps to understand the distribution of throughput results. Throughput is measured every 1 second, and on the chart I show all measurements after 2000 seconds of a run (the total length of each run is 3600 seconds). So I show the last 1600 seconds of each run (to remove warm-up phases):

To better quantify results, let’s take a look at them on a boxplot. The quickest way to understand boxplots is to take a look at the middle line. It represents a median of measurements (see more at

Before we jump to the summary of results, let’s take a look at a variation of the throughput for both InnoDB and MyRocks. We will zoom to a 1-second resolution chart for 100 GB of allocated memory:

We can see that there is a lot of variation with periodical 1-second performance drops with MyRocks. At this moment, I do not know what causes these drops.

So let’s take a look at the average throughput for each engine for different memory settings (the results are in tps, and more is better):

Memory, GB InnoDB MyRocks 5 849.0664 4205.714 10 1321.9 4298.217 20 1808.236 4333.424 30 2275.403 4394.413 40 2968.101 4459.578 50 3867.625 4503.215 60 4756.551 4571.163 70 5527.853 4576.867 80 5984.642 4616.538 90 5949.249 4620.87 100 5961.2 4599.143


This is where MyRocks behaves differently from InnoDB. InnoDB benefits greatly from additional memory, up to the size of working dataset. After that, there is no reason to add more memory.

At the same time, interestingly MyRocks does not benefit much from additional memory.

Basically, MyRocks performs as expected for a write-optimized engine. You can refer to my article How Three Fundamental Data Structures Impact Storage and Retrieval for more details. 

In conclusion, InnoDB performs better (compared to itself) when the working dataset fits (or almost fits) into available memory, while MyRocks can operate (and outperform InnoDB) on small memory sizes.

IO and CPU usage

It is worth looking at resource utilization for each engine. I took vmstat measurements for each run so that we can analyze IO and CPU usage.

First, let’s review writes per second (in KB/sec). Please keep in mind that these writes include binary log writes too, not just writes from the storage engine.

Memory, GB InnoDB MyRocks 5 244754.4 87401.54 10 290602.5 89874.55 20 311726 93387.05 30 313851.7 93429.92 40 316890.6 94044.94 50 318404.5 96602.42 60 276341.5 94898.08 70 217726.9 97015.82 80 184805.3 96231.51 90 187185.1 96193.6 100 184867.5 97998.26


We can also calculate how many writes per transaction each storage engine performs:

This chart shows the essential difference between InnoDB and MyRocks. MyRocks, being a write-optimized engine, uses a constant amount of writes per transaction.

For InnoDB, the amount of writes greatly depends on the memory size. The less memory we have, the more writes it has to perform.

What about reads?

The following table shows reads in KB per second.

Memory, GB InnoDB MyRocks 5 218343.1 171957.77 10 171634.7 146229.82 20 148395.3 125007.81 30 146829.1 110106.87 40 144707 97887.6 50 132858.1 87035.38 60 98371.2 77562.45 70 42532.15 71830.09 80 3479.852 66702.02 90 3811.371 64240.41 100 1998.137 62894.54


We can translate this to the number of reads per transaction:

This shows MyRocks’ read-amplification. The allocation of more memory helps to decrease IO reads, but not as much as for InnoDB.

CPU usage

Let’s also review CPU usage for each storage engine. Let’s start with InnoDB:

The chart shows that for 5GB memory size, InnoDB spends most of its time in IO waits (green area), and the CPU usage (blue area) increases with more memory.

This is the same chart for MyRocks:

In tabular form:

Memory, GB engine us sys wa id 5 InnoDB 8 2 57 33 5 MyRocks 56 11 18 15 10 InnoDB 12 3 57 28 10 MyRocks 57 11 18 13 20 InnoDB 16 4 55 25 20 MyRocks 58 11 19 11 30 InnoDB 20 5 50 25 30 MyRocks 59 11 19 10 40 InnoDB 26 7 44 24 40 MyRocks 60 11 20 9 50 InnoDB 35 8 38 19 50 MyRocks 60 11 21 7 60 InnoDB 43 10 36 10 60 MyRocks 61 11 22 6 70 InnoDB 51 12 34 4 70 MyRocks 61 11 23 5 80 InnoDB 55 12 31 1 80 MyRocks 61 11 23 5 90 InnoDB 55 12 32 1 90 MyRocks 61 11 23 4 100 InnoDB 55 12 32 1 100 MyRocks 61 11 24 4


We can see that MyRocks uses a lot of CPU (in us+sys state) no matter how much memory is allocated. This leads to the conclusion that MyRocks performance is limited more by CPU performance than by available memory.

MyRocks directory size

As MyRocks writes all changes and compacts SST files down the road, it would be interesting to see how the data directory size changes during the benchmark so we can estimate our storage needs. Here is a chart of datadirectory size:

We can see that datadirectory goes from 20GB at the start, to 31GB during the benchmark. It is interesting to observe the data growing until compaction shrinks it.


In conclusion, I can say that MyRocks performance increases as the ratio of dataset size to memory increases, outperforming InnoDB by almost five times in the case of 5GB memory allocation. Throughput variation is something to be concerned about, but I hope this gets improved in the future.

MyRocks does not require a lot of memory and shows constant write IO, while using most of the CPU resources.

I think this potentially makes MyRocks a great choice for cloud database instances, where both memory and IO can cost a lot. MyRocks deployments would make it cheaper to deploy in the cloud.

I will follow up with further cloud-oriented benchmarks.

Extras Raw results, scripts and config

My goal is to provide fully repeatable benchmarks. To this end, I’m  sharing all the scripts and settings I used in the following GitHub repo:

MyRocks settings

rocksdb_max_open_files=-1 rocksdb_max_background_jobs=8 rocksdb_max_total_wal_size=4G rocksdb_block_size=16384 rocksdb_table_cache_numshardbits=6 # rate limiter rocksdb_bytes_per_sync=16777216 rocksdb_wal_bytes_per_sync=4194304 rocksdb_compaction_sequential_deletes_count_sd=1 rocksdb_compaction_sequential_deletes=199999 rocksdb_compaction_sequential_deletes_window=200000 rocksdb_default_cf_options="write_buffer_size=256m;target_file_size_base=32m;max_bytes_for_level_base=512m;max_write_buffer_number=4;level0_file_num_compaction_trigger=4;level0_slowdown_writes_trigger=20;level0_stop_writes_trigger=30;max_write_buffer_number=4;block_based_table_factory={cache_index_and_filter_blocks=1;filter_policy=bloomfilter:10:false;whole_key_filtering=0};level_compaction_dynamic_level_bytes=true;optimize_filters_for_hits=true;memtable_prefix_bloom_size_ratio=0.05;prefix_extractor=capped:12;compaction_pri=kMinOverlappingRatio;compression=kLZ4Compression;bottommost_compression=kLZ4Compression;compression_opts=-14:4:0" rocksdb_max_subcompactions=4 rocksdb_compaction_readahead_size=16m rocksdb_use_direct_reads=ON rocksdb_use_direct_io_for_flush_and_compaction=ON

InnoDB settings

# files  innodb_file_per_table  innodb_log_file_size=15G  innodb_log_files_in_group=2  innodb_open_files=4000 # buffers  innodb_buffer_pool_size= 200G  innodb_buffer_pool_instances=8  innodb_log_buffer_size=64M # tune  innodb_doublewrite= 1  innodb_support_xa=0  innodb_thread_concurrency=0  innodb_flush_log_at_trx_commit= 1  innodb_flush_method=O_DIRECT_NO_FSYNC  innodb_max_dirty_pages_pct=90  innodb_max_dirty_pages_pct_lwm=10  innodb_lru_scan_depth=1024  innodb_page_cleaners=4  join_buffer_size=256K  sort_buffer_size=256K  innodb_use_native_aio=1  innodb_stats_persistent = 1  #innodb_spin_wait_delay=96 # perf special  innodb_adaptive_flushing = 1  innodb_flush_neighbors = 0  innodb_read_io_threads = 4  innodb_write_io_threads = 2  innodb_io_capacity=2000  innodb_io_capacity_max=4000  innodb_purge_threads=4  innodb_adaptive_hash_index=1

Hardware spec

Supermicro server:

  • CPU:
    • Intel(R) Xeon(R) CPU E5-2683 v3 @ 2.00GHz
    • 2 sockets / 28 cores / 56 threads
  • Memory: 256GB of RAM
  • Storage: SAMSUNG  SM863 1.9TB Enterprise SSD
  • Filesystem: ext4
  • Percona-Server-5.7.21-20
  • OS: Ubuntu 16.04.4, kernel 4.13.0-36-generic

The post A Look at MyRocks Performance appeared first on Percona Database Performance Blog.

Keep Sensitive Data Secure in a Replication Setup

April 30, 2018 - 2:47pm

This blog post describes how to keep sensitive data secure on slave servers in a MySQL async replication setup.

Almost every web application has a sensitive data: passwords, SNN, credit cards, emails, etc. Splitting the database to secure and “public” parts allows for restricting user and application parts access to sensitive data.

Field encryption

This is based on MySQL encryption functions or on client-side encryption when the authorized user knows a secret, but encrypted data is distributed to all slaves.

  • If possible, use hashes with a big enough salt, and do not store real sensitive data in the database. A good example is passwords. An end-user sends the login and password, application/SQL code calculates the hash with a salt value unique for each end-user and compares the hash with the value stored in the database. Even if the attacker gets the hashes, it’s still hard or even impossible to extract real passwords for all users. Make sure that you are using a good random number generator for the salt, application-side secret, and a good hash function (not MD5).
  • Encryption is not suitable if you are going to provide public access to your database (via slave dumps in sql/csv/xml/json format).
  • Encryption is a complex topic. Check here for a good blog post explaining hashing usage, and try to find a security consultant if you are inventing some “new” method of storing and encrypting data.
Field encryption example

I’m using a single server setup, because the most important part of data separation should be done on the application side. The secure part of the application has a secret passphrase. For example, you can place the code working with authentication, full profile and payments on a separate server and use a dedicated MySQL account.

create database encrypted; use encrypted; create table t(c1 int, c2 varchar(255), rnd_pad varbinary(16), primary key(c1)); SET block_encryption_mode = 'aes-256-cbc'; SET @key_str = SHA2('My secret passphrase',512); SET @init_vector = RANDOM_BYTES(16); insert into t (c1,c2, rnd_pad) values (1, AES_ENCRYPT('Secret', @key_str, @init_vector), @init_vector); -- decrypt data select c1, AES_DECRYPT(c2,@key_str, rnd_pad) from t;

  • GOOD: Master and slave servers have exactly the same data and no problems with replication.
  • GOOD: Even if two different end-users have exactly the same password, the stored values are different due to random bytes in the init vector for AES encryption.
  • GOOD: Both the encryption and random number generation uses an external library (openssl).
  • CONF: It’s important to have binlog_format=ROW to avoid sending the secret to slave servers.
  • CONF: Do not allow end-users to change data without changing the init_vector, especially for small strings without random padding. Each update should cause init_vector re-generation.
  • BAD: Encrypted data is still sent to slave servers. If the encryption algorithm or protocol is broken, it is possible to get access to data from an insecure part of the application.
  • BAD: The described protocol still could be insecure.
Replication filters

There are two types of replication filters: a master-side with binlog-*db and a slave-side with replicate-*.

Both could cause replication breakage. Replication filters were created for STATEMENT-based replication and are problematic with modern binlog_format=ROW + gtid_mode=on setup. You can find several cases related to database-level slave-side filters in this blog post. If you still need slave-side filtering, use per-table replicate-wild-*-table options.


Even if binary logging is disabled for a specific database, the statement still could be stored in the binary log if it’s a DDL statement, or if the binlog_format is STATEMENT or MIXED and default database is not used by the statement. For details, see the reference manual for the binlog-do-db option. In order to avoid replication issues, you should use ROW-based replication and run SET SESSION sql_log_bin=0; before each DDL statement is executed against the ignored database. It’s not a good idea to use binlog-do-db, because you are losing control of what should be replicated.

Why is binary log filtering useful? Changing the sql_log_bin variable is prohibited inside transactions. The sql_log_bin is DANGEROUS, please do not use it instead of binlog-ignore-db in production on the application side. If you need it for database administration, make sure that you are always typing the “session” word before sql_log_bin. This makes problematic consistent updates of multiple entities inside database.

We still should have the ability to hide just one column from the table. But if we are ignoring the database, we should provide a method of reading non-secure data on slaves / by restricted MySQL accounts. This is possible with triggers and views:

create database test; set session sql_log_bin=0; create table test.t(c1 int, c2 int, primary key(c1)); alter table test.t add primary key(c1); set session sql_log_bin=1; create database test_insecure; create table test_insecure.t(c1 int, c2 int default NULL, primary key(c1)); use test delimiter // create trigger t_aft_ins after insert on test.t FOR EACH ROW BEGIN INSERT test_insecure.t (c1) values (NEW.c1); END // create trigger t_aft_upd after update on test.t FOR EACH ROW BEGIN UPDATE test_insecure.t SET c1 = NEW.c1 WHERE c1 = OLD.c1; END // create trigger t_aft_del after delete on test.t FOR EACH ROW BEGIN DELETE FROM test_insecure.t WHERE c1 = OLD.c1; END // delimiter ; -- just on slave: create database test; create view test.t as select * from test_insecure.t; -- typical usage INSERT INTO test.t values(1,1234); SELECT * from test.t; -- works on both master and slave, c2 field will have NULL value on slave.

  • BAD: The data is not the same on the master and slaves. It potentially breaks replication. It’s not possible to use a slave’s backup to restore the master or promote the slave as a new master.
  • BAD: Triggers could reduce DML statement performance.
  • GOOD: The sensitive data is not sent to slaves at all (and not written to binary log).
  • GOOD: It works with GTID
  • GOOD: It requires no application changes (or almost no application changes).
  • GOOD: binlog-ignore-db allows us to not use the dangerous sql_log_bin variable after initial table creation.

The post Keep Sensitive Data Secure in a Replication Setup appeared first on Percona Database Performance Blog.

This Week In Data with Colin Charles 37: Percona Live 2018 Wrap Up

April 27, 2018 - 3:17pm

Join Percona Chief Evangelist Colin Charles as he covers happenings, gives pointers and provides musings on the open source database community.

Percona Live Santa Clara 2018 is now over! All things considered, I think it went off quite well; if you have any comments/complaints/etc., please don’t hesitate to drop me a line. I believe a survey will be going out as to where you’d like to see the conference in 2019 – yes, it is no longer going to be at the Santa Clara Convention Centre.

I was pleasantly surprised that several people came up to me saying they read this column and enjoy it. Thank you!

The whole conference was abuzz with MySQL 8.0 GA chatter. Many seemed to enjoy the PostgreSQL focus too, now that Percona announced PostgreSQL support.

Congratulations as well to the MySQL Community Awards 2018 winners.

Releases Link List Upcoming appearances Feedback

I look forward to feedback/tips via e-mail at or on Twitter @bytebot.

The post This Week In Data with Colin Charles 37: Percona Live 2018 Wrap Up appeared first on Percona Database Performance Blog.

Visit Percona Store

General Inquiries

For general inquiries, please send us your question and someone will contact you.