Update on the InnoDB double-write buffer and EXT4 transactions

Latest MySQL Performance Blog posts - June 17, 2015 - 7:15am

In a post, written a few months ago, I found that using EXT4 transactions with the “data=journal” mount option, improves the write performance significantly, by 55%, without putting data at risk. Many people commented on the post mentioning they were not able to reproduce the results and thus, I decided to further investigate in order to find out why my results were different.

So, I ran sysbench benchmarks on a few servers and found when the InnoDB double-write buffer limitations occur and when they don’t. I also made sure some of my colleagues were able to reproduce the results. Basically, in order to reproduce the results you need the following conditions:

  • Spinning disk (no SSD)
  • Enough CPU power
  • A dataset that fits in the InnoDB buffer pool
  • A continuous high write load with many ops waiting for disk

Using the InnoDB double write buffer on an SSD disk somewhat prevents us from seeing the issue, something good performance wise. That comes from the fact that the latency of each write operation is much lower. That makes sense, the double-writer buffer is an area of 128 pages on disk that is used by the write threads. When a write thread needs to write a bunch of dirty pages to disk, it first writes them sequentially to free slots in the double write buffer in a single iop and then, it spends time writing the pages to their actual locations on disk using typically one iop per page. Once done writing, it releases the double-write buffer slots it was holding and another thread can do its work. The presence of a raid controller with a write cache certainly helps, at least until the write cache is full. Thus, since I didn’t tested with a raid controller, I suspect a raid controller write cache will delay the apparition of the symptoms but if the write load is sustained over a long period of time, the issue with the InnoDB double write buffer will appear.

So, to recapitulate, on a spinning disk, a write thread needs to hold a lock on some of the double-write buffer slots for at least a few milliseconds per page it needs to write while on a SSD disk, the slots are released very quickly because of the low latency of the SSD storage. To actually stress the InnoDB double-write buffer on a SSD disk, one must push much more writes.

That leads us to the second point, the amount of CPU resources available. At first, one of my colleague tried to reproduce the results on a small EC2 instance and failed. It appeared that by default, the sysbench oltp.lua script is doing quite a lot of reads and those reads saturate the CPU, throttling the writes. By lowering the amount of reads in the script, he was then able to reproduce the results.

For my benchmarks, I used the following command:

sysbench --num-threads=16 --mysql-socket=/var/lib/mysql/mysql.sock
--mysql-database=sbtest --mysql-user=root
--test=/usr/share/doc/sysbench/tests/db/oltp.lua --oltp-table-size=50000000
--oltp-test-mode=complex --mysql-engine=innodb --db-driver=mysql
--report-interval=60 --max-requests=0 --max-time=3600 run

Both servers used were metal boxes with 12 physical cores (24 HT). With less CPU resources, I suggest adding the following parameters:


So that the CPU is not wasted on reads and enough writes are generated. Remember we are not doing a generic benchmarks, we are just stressing the InnoDB double-write buffer.

In order to make sure something else isn’t involved, I verified the following:

  • Server independence, tried on 2 physical servers and one EC2 instance, Centos 6 and Ubuntu 14.04
  • MySQL provided, tried on MySQL community and Percona Server
  • MySQL version, tried on 5.5.37 and 5.6.23 (Percona Server)
  • Varied the InnoDB log file size from 32MB to 512MB
  • The impacts of the number of InnoDB write threads (1,2,4,8,16,32)
  • The use of Linux native asynchronous iop
  • Spinning and SSD storage

So, with all those verifications done, I can maintain that if you are using a server with spinning disks and a high write load, using EXT4 transactions instead of the InnoDB double write buffer yields to an increase in throughput of more than 50%. In an upcoming post, I’ll show how the performance stability is affected by the InnoDB double-write buffer under a high write load.

Appendix: the relevant part of the my.cnf

innodb_buffer_pool_size = 12G
innodb_write_io_threads = 8 # or else in {1,2,4,8,16,32}
innodb_read_io_threads = 8
innodb_flush_log_at_trx_commit = 0 # must be 0 or 2 to really stress the double write buffer
innodb_log_file_size = 512M # or 32M, 64M
innodb_log_files_in_group = 2
innodb_flush_method=O_DIRECT # or O_DSYNC
innodb_buffer_pool_restore_at_startup=300 # On 5.5.x, important to warm up the buffer pool
#innodb_buffer_pool_load_at_startup=ON # on 5.6, important to warm up the buffer pool
#innodb_buffer_pool_dump_at_shutdown=ON # on 5.6, important to warm up the buffer pool,
skip-innodb_doublewrite # or commented out
innodb_flush_neighbor_pages=none # or area for spinning

The post Update on the InnoDB double-write buffer and EXT4 transactions appeared first on MySQL Performance Blog.

Performance inputs for PXC

Lastest Forum Posts - June 17, 2015 - 5:41am
Please suggest some inputs on read/write performance on PXC.

H/W -- 3 nodes with 1 HA Proxy

12 Core CPU/DELL
Flash Drive 2.1 Tb
PXC -- 5.6.24 PXC (GPL)

Sample bench marking:
# mysqlslap --concurrency=40 --auto-generate-sql-load-type=read --host= --port=3307 --user=****** --password=********* --auto-generate-sql --number-of-queries=100000 --number-char-cols=9 --number-int-cols=13 --auto-generate-sql-secondary-indexes=10 --engine=innodb --verbose
Warning: Using a password on the command line interface can be insecure.
Running for engine innodb
Average number of seconds to run all queries: 19.978 seconds
Minimum number of seconds to run all queries: 19.978 seconds
Maximum number of seconds to run all queries: 19.978 seconds
Number of clients running queries: 40
Average number of queries per client: 2500
]# mysqlslap --concurrency=140 --auto-generate-sql-load-type=write --host= --port=3307 --user=***** --password=****** --auto-generate-sql --number-of-queries=100000 --number-char-cols=9 --number-int-cols=13 --auto-generate-sql-secondary-indexes=10 --engine=innodb --verbose
Warning: Using a password on the command line interface can be insecure.
Running for engine innodb
Average number of seconds to run all queries: 15.062 seconds
Minimum number of seconds to run all queries: 15.062 seconds
Maximum number of seconds to run all queries: 15.062 seconds
Number of clients running queries: 140
Average number of queries per client: 714

Sample config on PXC:

# Path to Galera library
# Cluster connection URL contains the IPs of node#1, node#2 and node#3
# In order for Galera to work correctly binlog format should be ROW
# MyISAM storage engine has only experimental support
# This changes how InnoDB autoincrement locks are managed and is a requirement for Galera
# Node #1 address
# SST method
# Cluster name
# Authentication for SST method


wsrep_provider_options="gcache.size=128M;gcache.pa ge_size =128M;"

Speed up GROUP BY queries with subselects in MySQL

Latest MySQL Performance Blog posts - June 15, 2015 - 11:32am

We usually try to avoid subselects because sometimes they force the use of a temporary table and limits the use of indexes. But, when is good to use a subselect?

This example was tested over table a (1310723 rows), b, c and d ( 5 rows each) and with MySQL version 5.5 and 5.6.

Let’s suppose we have a query like this:

select a.name,sum(a.count) aSum,avg(a.position) aAVG,b.col1,c.col2,d.col3 from a join b on (a.bid = b.id) join c on (a.cid = c.id) join d on (a.did = d.id) group by a.name,b.id,c.id,d.id

What will MySQL do? First it will take the entire data set – this means that will go through each row scanning the value of  “bid,” “cid” and “did” and then apply the join to each table. At this point it has the complete data set and then it will start to cluster it, executing the sum and the average functions.

Let’s analyze it step by step:

  1. Scan each row of  table a which has 1310720 rows.
  2. Join each row of table a with b, c and d – this means that each of the 1310720 rows will be joined, making the temporary table bigger.
  3. Execute the group by which will scan again the 1310720 rows and creating the result data set.

What can we do to optimize this query? We can’t avoid the group by over the 1.3M rows, but we are able to avoid the join over 1.3M of rows. How? We need all of the information from table a for the “group by” but we don’t need to execute all the joins before clustering them. Let’s rewrite the query:

select a.name,aSum,aAVG,b.col1,c.col2,d.col3 from ( select name,sum(count) aSum ,avg(position) aAVG,bid,cid,did from a group by name,bid,cid,did) a join b on (a.bid = b.id) join c on (a.cid = c.id) join d on (a.did = d.id)

We see from the above query that we are doing the “group by” only over table a, the result data set of that subquery is just 20 rows. But what about the query response time? The first query took 2.3 sec avg and the optimized query took 1.8 sec average, half a second faster.

What about adding a covering index? The index that we can add will be:

alter table a add index (name,bid,cid,did,count,position);

The explain plan of both queries shows that it is using just the index to resolve the query.

Now, the response time of the original query is 1.9 sec which is near the time of the optimized query. However, the response time of the optimized query now is 0.7 sec, nearly 3x faster. The cons of adding this index is that we are indexing the whole table and it shows that the index length is near 80% of the data length.

If the original query had “where” conditions, it will depend over which field. Let’s suppose add c.col2=3:select a.name,sum(a.count) aSum,avg(a.position) aAVG,b.col1,c.col2,d.col3 from a join b on (a.bid = b.id) join c on (a.cid = c.id) join d on (a.did = d.id) where c.col2=3 group by a.name,b.id,c.id,d.idNow, in the new query, the subquery will change. Table c and the “where” clause must be added to the subquery:select a.name,aSum,aAVG,b.col1,a.col2,d.col3 from ( select a.name,sum(count) aSum ,avg(position) aAVG,bid,cid,did,c.col2 from a join c on (a.cid = c.id) where c.col2=3 group by name,bid,cid,did) a join b on (a.bid = b.id) join d on (a.did = d.id)

But the differences in times are not as big (original query 1.1 sec and new query 0.9). Why? because the original query will have less data to group by. Adding c.col2=3 to the original query, the amount of data to group by is reduced from 1.3M to 262k. Indeed, if you add more “where” conditions on different tables, the dataset to sort will be smaller and the speed-up will decrease.

Conclusion: We usually add the GROUP BY at the end of queries, and that is ok because the syntax forces us to do it. However we can use a subquery to group only the data that we need and then perform the joins over other tables. This could speed up some of our GROUP BY queries.

The post Speed up GROUP BY queries with subselects in MySQL appeared first on MySQL Performance Blog.

Cannot add some Google account linked users to Percona Cloud organization

Lastest Forum Posts - June 15, 2015 - 10:41am
Error: "A user does not exist for (email)"

I have a user that signed up with their google account that I was able to add, but 2 other users whom signed up with their google account that I can't seem to add. Any suggestions?


Incremental streaming backups and restore

Lastest Forum Posts - June 15, 2015 - 9:31am

I'm new to Percona and have just been getting my head around the features it has. My ultimate goal is to be able to stream incremental backups to a remote server and to restore/rebuild the database on the remote host.

So far I've got to the point of streaming the increments according to this guide (at the very end):
#innobackupex --incremental --incremental-lsn=LSN-number --stream=xbstream ./ | ssh user@hostname " cat - | xbstream -x -C > /backup-dir/" My questions are:
1. How can I send the incremental backups without the need to refer to the LSN-number?
2. How can I rebuild the increments or rather prepare the backups on the remote host ?

Suggestions are most welcome!

My cluster crash randomly

Lastest Forum Posts - June 15, 2015 - 12:41am
Hi, i ran a cluster with 3 nodes. It's created and run normally in about 1 day before 1 node crash randomly.

My information
1. Server version: 5.6.21-70.1-56-log Percona XtraDB Cluster (GPL), Release rel70.1, Revision 938, WSREP version 25.8, wsrep_25.8.r4150

2. Ram 32G, CPU 24 core, 4 HDD - raid 10

3. datafile: 4G

4. file: /etc/my.conf

# Disabling symbolic-links is recommended to prevent assorted security risks

# Path to Galera library

wsrep_provider_options="gcache.size = 1G; gcache.page_size = 512M; gcs.fc_limit = 512"




# Cluster connection URL contains IPs of node#1, node#2 and node#3
# In order for Galera to work correctly binlog format should be ROW
# MyISAM storage engine has only experimental support
# This changes how InnoDB autoincrement locks are managed and is a requirement for Galera
# Node #2 address
# Cluster name
# SST method
#Authentication for SST method

# Maximum number of rows in write set
# Maximum size of write set

#################### TUNNING ########################

###### Slow query log


slow_query_log_file =/var/log/mysql/slow_queries.log



























##### Set Ramdisk #####

tmpdir = /usr/mysqltmp


5. Error message

05:05:00 UTC - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
Please help us make Percona XtraDB Cluster better by reporting any
bugs at https://bugs.launchpad.net/percona-xtradb-cluster

It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 6590372 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0xa09eba0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 7fa617fedd38 thread_stack 0x40000
/usr/sbin/mysqld(_ZN22Item_func_release_lock7val_intEv+0x10f )[0x60068f]
/usr/sbin/mysqld(_ZN8Protocol19send_result_set_rowEP4ListI4I temE+0xc7)[0x65ef47]
/usr/sbin/mysqld(_ZN11select_send9send_dataER4ListI4ItemE+0x 67)[0x6ae287]
/usr/sbin/mysqld(_Z12mysql_selectP3THDP10TABLE_LISTjR4ListI4 ItemEPS4_P10SQL_I_ListI8st_orderESB_S7_yP13select_ resultP18st_select_lex_unitP13st_select_lex+0x250)[0x7120e0]
/usr/sbin/mysqld(_Z13handle_selectP3THDP13select_resultm+0x1 87)[0x712967]
/usr/sbin/mysqld(_ZN18Prepared_statement7executeEP6Stringb+0 x40e)[0x7002ae]
/usr/sbin/mysqld(_ZN18Prepared_statement12execute_loopEP6Str ingbPhS2_+0xde)[0x7044ae]
/usr/sbin/mysqld(_Z11mysql_parseP3THDPcjP12Parser_state+0x65 8)[0x6f0338]
/usr/sbin/mysqld(_Z16dispatch_command19enum_server_commandP3 THDPcj+0x19d5)[0x6f2675]

Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (7fa4dc004178): is an invalid pointer
Connection ID (thread ID): 6222895

You may download the Percona XtraDB Cluster operations manual by visiting
http://www.percona.com/software/percona-xtradb-cluster/. You may find information
in the manual which will help you identify the cause of the crash.
150608 12:05:00 mysqld_safe Number of processes running now: 0
150608 12:05:00 mysqld_safe WSREP: not restarting wsrep node automatically
150608 12:05:00 mysqld_safe mysqld from pid file /var/lib/mysql/RR-Cluster-DB2.pid ended

I dig the internet for a month but nothing can fix my error. Hope some here, in percona forum can help me.
Thanks million times.

Third Node cannot connect / 2 Nodes in AWS / 1 node inhouse with NAT

Lastest Forum Posts - June 12, 2015 - 10:53pm
Hi guys ..

I always read all posts and normally they solve my issues.
I would like to take the time to explain my situation. Please bear with me ..
All node using Ubuntu 12.04
AWS nodes using percona Server version: 5.5.34-31.1 Percona XtraDB Cluster (GPL), Release 31.1, wsrep_25.9.r3928
Both AWS have elastic public IP
In house node using Percona 5.5.41-25.11-853

AWS configured of course with the internal IPs 10.xxx.xxx.xxx
and In house its a through NAT on a public IP 199.xxx.xxx.xxx

Both AWS nodes have in the gcomm the public IP of the in house server and the in house server has the 2 elastic ips and the internal

And here is the dump of the

tail -f /var/lib/mysql/novo.err

150612 23:25:52 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
150612 23:25:52 mysqld_safe WSREP: Running position recovery with --log_error='/var/lib/mysql/wsrep_recovery.UoqEvJ' --pid-file='/var/lib/mysql/noomedici-recover.pid'
150612 23:25:55 mysqld_safe WSREP: Recovered position 00000000-0000-0000-0000-000000000000:-1
150612 23:25:55 [Note] WSREP: wsrep_start_position var submitted: '00000000-0000-0000-0000-000000000000:-1'
150612 23:25:55 [Note] WSREP: Read nil XID from storage engines, skipping position init
150612 23:25:55 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib/libgalera_smm.so'
150612 23:25:55 [Note] WSREP: wsrep_load(): Galera 2.12(r318911d) by Codership Oy <info@codership.com> loaded successfully.
150612 23:25:55 [Note] WSREP: Found saved state: 00000000-0000-0000-0000-000000000000:-1
150612 23:25:55 [Note] WSREP: Reusing existing '/var/lib/mysql//galera.cache'.
150612 23:25:55 [Note] WSREP: Passing config to GCS: base_host =; base_port = 4567; cert.log_conflicts = no; debug = no; evs.inactivecheck_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_eport_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /var/lib/mysql/; gcache.kep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.size = 128M; gcs.fc_debug = 0;gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 923372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.version = 0; pc.announce_timeout = PT3S; pc.checksum = false; pc.ignor_quorum = false; pc.ignore_sb = false; pc.npvo = false; pc.version = 0; pc.wait_prim = true; pc.wait_prim_timeout = P30S; pc.weight = 1; protonet.ackend = asio; pr
150612 23:25:55 [Note] WSREP: Assign initial position for certification: -1, protocol version: -1
150612 23:25:55 [Note] WSREP: wsrep_sst_grab()
150612 23:25:55 [Note] WSREP: Start replication
150612 23:25:55 [Note] WSREP: Setting initial position to 00000000-0000-0000-0000-000000000000:-1
150612 23:25:55 [Note] WSREP: protonet asio version 0
150612 23:25:55 [Note] WSREP: backend: asio
150612 23:25:55 [Note] WSREP: GMCast version 0
150612 23:25:55 [Note] WSREP: (aa293f06, 'tcp://') listening at tcp://
150612 23:25:55 [Note] WSREP: (aa293f06, 'tcp://') multicast: , ttl: 1
150612 23:25:55 [Note] WSREP: EVS version 0
150612 23:25:55 [Note] WSREP: PC version 0
150612 23:25:55 [Note] WSREP: gcomm: connecting to group 'Percona-XtraDB-Cluster', peer '107.xx.xxx.xx:,54.xxx.xx.xxx:,'
150612 23:25:55 [Warning] WSREP: (aa293f06, 'tcp://') address 'tcp://' points to own listening address, blacklisting
150612 23:25:55 [Note] WSREP: (aa293f06, 'tcp://') turning message relay requesting on, nonlive peers: tcp://
150612 23:25:55 [Note] WSREP: declaring 919c045d at tcp:// stable
150612 23:25:55 [Note] WSREP: declaring af980804 at tcp:// stable
150612 23:25:56 [Note] WSREP: Node 919c045d state prim
150612 23:25:57 [Note] WSREP: gcomm: connected
150612 23:25:57 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636
150612 23:25:57 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0)
150612 23:25:57 [Note] WSREP: Opened channel 'Percona-XtraDB-Cluster'
150612 23:25:57 [Note] WSREP: Waiting for SST to complete.
150612 23:25:57 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 3
150612 23:25:57 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
150612 23:25:57 [Note] WSREP: STATE EXCHANGE: sent state msg: ab12c5d2-118c-11e5-a9f3-cf551b1e5734
150612 23:25:57 [Note] WSREP: STATE EXCHANGE: got state msg: ab12c5d2-118c-11e5-a9f3-cf551b1e5734 from 0 (db2)
150612 23:25:57 [Note] WSREP: STATE EXCHANGE: got state msg: ab12c5d2-118c-11e5-a9f3-cf551b1e5734 from 2 (db1)
150612 23:25:57 [Note] WSREP: STATE EXCHANGE: got state msg: ab12c5d2-118c-11e5-a9f3-cf551b1e5734 from 1 (db3)
150612 23:25:57 [Note] WSREP: Quorum results:
version = 2,
component = PRIMARY,
conf_id = 2,
members = 2/3 (joined/total),
act_id = 35909744,
last_appl. = -1,
protocols = 0/4/2 (gcs/repl/appl),
group UUID = d3538475-b069-11e3-92aa-c323b5f792a4
150612 23:25:57 [Note] WSREP: Flow-control interval: [28, 28]
150612 23:25:57 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 35909744)
150612 23:25:57 [Note] WSREP: State transfer required:
Group state: d3538475-b069-11e3-92aa-c323b5f792a4:35909744
Local state: 00000000-0000-0000-0000-000000000000:-1
150612 23:25:57 [Note] WSREP: New cluster view: global state: d3538475-b069-11e3-92aa-c323b5f792a4:35909744, view# 3: Primary, number of nodes: 3,my index: 1, protocol version 2
150612 23:25:57 [Note] WSREP: closing client connections for protocol change 3 -> 2
150612 23:25:58 [Note] WSREP: (aa293f06, 'tcp://') turning message relay requesting off
150612 23:25:59 [Warning] WSREP: Gap in state sequence. Need state transfer.
150612 23:25:59 [Note] WSREP: Running: 'wsrep_sst_xtrabackup --role 'joiner' --address '' --auth 'sstuser:zASDF@7890@yhn' --datadir '/ar/lib/mysql/' --defaults-file '/etc/mysql/my.cnf' --parent '25936''
WSREP_SST: [INFO] Streaming with tar (20150612 23:25:59.618)
WSREP_SST: [INFO] Using socat as streamer (20150612 23:25:59.624)
WSREP_SST: [INFO] Stale sst_in_progress file: /var/lib/mysql//sst_in_progress (20150612 23:25:59.636)
WSREP_SST: [INFO] Evaluating socat -u TCP-LISTEN:4444,reuseaddr stdio | tar xfi - --recursive-unlink -h; RC=( ${PIPESTATUS[@]} ) (20150612 23:25:5.664)
150612 23:25:59 [Note] WSREP: Prepared SST request: xtrabackup|
150612 23:25:59 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
150612 23:25:59 [Note] WSREP: Assign initial position for certification: 35909744, protocol version: 2
150612 23:25:59 [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does no match group state UUID (d3538475-b069-11e3-92aa-c323b5f792a4): 1 (Operation not permitted)
at galera/src/replicator_str.cpprepare_for_IST():447. IST will be unavailable.
150612 23:25:59 [Note] WSREP: Node 1 (db3) requested state transfer from '*any*'. Selected 0 (db2)(SYNCED) as donor.
150612 23:25:59 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 35909744)
150612 23:25:59 [Note] WSREP: Requesting state transfer: success, donor: 0

150612 23:27:04 [Warning] WSREP: 0 (db2): State transfer to 1 (db3) failed: -1 (Operation not permitted)
150612 23:27:04 [ERROR] WSREP: gcs/src/gcs_group.cpp:long int gcs_group_handle_join_msg(gcs_group_t*, const gcs_recv_msg_t*)():717: Will never recive state. Need to abort.
150612 23:27:04 [Note] WSREP: gcomm: terminating thread
150612 23:27:04 [Note] WSREP: gcomm: joining thread
150612 23:27:04 [Note] WSREP: gcomm: closing backend
150612 23:27:04 [Note] WSREP: gcomm: closed
150612 23:27:04 [Note] WSREP: /usr/sbin/mysqld: Terminated.
Aborted (core dumped)
150612 23:27:04 mysqld_safe mysqld from pid file /var/lib/mysql/novomedici.pid ended

FYI Galera used on the AWS servers is ..

| wsrep_provider_version | 2.8(r165) |

and on the in house server as you can see above Galera 2.12(r318911d)

could this be the problem ??

Thank you everyone and sorry if I posted on the wrong place hope not ..


pt-query-digest and prepare statements

Lastest Forum Posts - June 12, 2015 - 9:11pm
I am using pt-query-digest from Percona Toolkit and I found out that the tool is not examining the prepare or execute statements from MySQL query log.Could you please tell if it is possible to analyse prepare statements from log at all.

Percona XtraDB Cluster: Quorum and Availability of the cluster

Latest MySQL Performance Blog posts - June 12, 2015 - 6:44am

Percona XtraDB Cluster (PXC) has become a popular option to provide high availability for MySQL servers. However many people are still having a hard time understanding what will happen to the cluster when one or several nodes leave the cluster (gracefully or ungracefully). This is what we will clarify in this post.

Nodes leaving gracefully

Let’s assume we have a 3-node cluster and all nodes have an equal weight, which is the default.

What happens if Node1 is gracefully stopped (service mysql stop)? When shutting down, Node1 will instruct the other nodes that it is leaving the cluster. We now have a 2-node cluster and the remaining members have 2/2 = 100% of the votes. The cluster keeps running normally.

What happens now if Node2 is gracefully stopped? Same thing, Node3 knows that Node2 is no longer part of the cluster. Node3 then has 1/1 = 100% of the votes and the 1-node cluster can keep on running.

In these scenarios, there is no need for a quorum vote as the remaining node(s) always know what happened to the nodes that are leaving the cluster.

Nodes becoming unreachable

On the same 3-node cluster with all 3 nodes running, what happens now if Node1 crashes?

This time Node2 and Node3 must run a quorum vote to estimate if it is safe continue: they have 2/3 of the votes, 2/3 is > 50%, so the remaining 2 nodes have quorum and they keep on working normally.

Note that the quorum vote does not happen immediately when Node2 and Node3 are not able to join Node1. It only happens after the ‘suspect timeout’ (evs.suspect_timeout) which is 5 seconds by default. Why? It allows the cluster to be resilient to short network failures which can be quite useful when operating the cluster over a WAN. The tradeoff is that if a node crashes, writes are stalled during the suspect timeout.

Now what happens if Node2 also crashes?

Again a quorum vote must be performed. This time Node3 has only 1/2 of the votes: this is not > 50% of the votes. Node3 doesn’t have quorum, so it stops processing reads and writes.

If you look at the wsrep_cluster_status status variable on the remaining node, it will show NON_PRIMARY. This indicates that the node is not part of the Primary Component.

Why does the remaining node stop processing queries?

This is a question I often hear: after all, MySQL is up and running on Node3 so why is it prevented from running any query? The point is that Node3 has no way to know what happened to Node2:

  • Did it crash? In this case, it is safe for the remaining node to keep on running queries.
  • Or is there a network partition between the two nodes? In this case, it is dangerous to process queries because Node2 might also process other queries that will not be replicated because of the broken network link: the result will be two divergent datasets. This is a split-brain situation, and it is a serious issue as it may be impossible to later merge the two datasets. For instance if the same row has been changed in both nodes, which row has the correct value?

Quorum votes are not held because it’s fun, but only because the remaining nodes have to talk together to see if they can safely proceed. And remember that one of the goals of Galera is to provide strong data consistency, so any time the cluster does not know whether it is safe to proceed, it takes a conservative approach and it stops processing queries.

In such a scenario, the status of Node3 will be set to NON_PRIMARY and a manual intervention is needed to re-bootstrap the cluster from this node by running:

SET GLOBAL wsrep_provider_options='pc.boostrap=YES';

An aside question is: now it is clear why writes should be forbidden in this scenario, but what about reads? Couldn’t we allow them?

Actually this is possible from PXC 5.6.24-25.11 with the wsrep_dirty_reads setting.


Split-brain is one of the worst enemies of a Galera cluster. Quorum votes will take place every time one or several nodes suddenly become unreachable and are meant to protect data consistency. The tradeoff is that it can hurt availability, because in some situations a manual intervention is necessary to instruct the remaining nodes that they can accept executing queries.

The post Percona XtraDB Cluster: Quorum and Availability of the cluster appeared first on MySQL Performance Blog.

pt-upgrade Out of Memory

Lastest Forum Posts - June 11, 2015 - 2:19pm
Hi all,

I'm running a test using the Percona Toolkit pt-upgrade. When i start the test the referential log has 20Megabytes size, but the tests never finish ok. Always Linux Ubuntu Server kills the test because perl has 15 gigabytes of memory taken.

I added a swap file with 30Gigabytes of size, but the test has broken again with the same error, perl consume the 94% of mememory.

I don't know if is necesary configure some Linux's parameter or MariaDB's parameter.

MariaDB Version is 5.5 (5.5.42-MariaDB-1~precise-log).

Thanks in advance for your answer.

can't figure out ssh templates configuration

Lastest Forum Posts - June 11, 2015 - 9:02am
Hello all, I am trying to use the nginx templates (I have used other cacti templates from percona before like the mysql ones)

do I have to do anything in special if I am simply monitoring the local machine? it seems I cant even figure that out.

About Cmake error

Lastest Forum Posts - June 11, 2015 - 3:25am
i try to install Percona XtraBackup.

I have follow this KB https://www.percona.com/doc/percona-...ing-with-cmake, it's a CentOS minimal 6.6 clean.

After run "yum install" with all package i have send this command:

cmake -DBUILD_CONFIG=xtrabackup_release && make -j4 The output is

CMake Error: The source directory "/root" does not appear to contain CMakeLists.txt. I've located my installation CMakeLists.txt, and the output is

[root@backvm ~]# locate CMakeLists.txt
/usr/share/doc/cmake- Now i've run my command in /usr/share/cmake/Modules/FortranCInterface/ & /usr/share/cmake/Modules/IntelVSImplicitPath/ and the ouput is

[root@backvm IntelVSImplicitPath]# cmake -DBUILD_CONFIG=xtrabackup_release && make -j4
-- The Fortran compiler identification is unknown
CMake Error: your Fortran compiler: "CMAKE_Fortran_COMPILER-NOTFOUND" was not found. Please set CMAKE_Fortran_COMPILER to a valid compiler path or name.
-- Configuring incomplete, errors occurred!
See also "/usr/share/cmake/Modules/IntelVSImplicitPath/CMakeFiles/CMakeOutput.log".
See also "/usr/share/cmake/Modules/IntelVSImplicitPath/CMakeFiles/CMakeError.log". How can solve this issue?

Percona Live Europe 2015! Call for speakers; registration now open

Latest MySQL Performance Blog posts - June 10, 2015 - 12:00am

Percona Live is moving from London to Amsterdam this year and the event is also expanding to three full days. Percona Live Europe 2015, September 21-23, will be at the Mövenpick Hotel Amsterdam City Centre. The call for speakers and Super Saver registration are now open. Hurry though because the deadline for submitting a speaking proposal is June 21st and Super Saver registration ends July 5th!

This year’s conference will feature one day of tutorials and two days of keynote talks and breakout sessions related to MySQL, NoSQL and Data in the Cloud. You’ll get briefed on the hottest topics, learn about operating a high-performing deployment and hear from top-industry leaders describe the future of the ecosystem – encompassing MySQL, MariaDB, Percona Server, MongoDB (and more). Attendees include DBAs, sysadmins, developers, architects, CTOs, CEOs, and vendors from around the world.

Have something to say? Why not lead a breakout session or a tutorial?

Breakout sessions are 50 minutes including a Q&A. Tutorial sessions focus on an immediate and practical application of in-depth MySQL and NoSQL knowledge. Tutorial speakers should assume that attendees will have laptops to work through detailed and potentially hands-on presentations. Tutorials are typically three hours long including a Q&A, however, if you have content for a full day, submissions for 6-hour-long tutorials are also being accepted. If your tutorial or breakout session is approved, you’ll receive a complimentary full-conference pass.

Huge thanks to our Conference Committee!

There would be no Percona Live without the hard work of our conference committees. Meet this year’s Percona Live Europe 2015 Conference Committee – a dedicated group of experts in MySQL, NoSQL and Data in the Cloud:

  • Erik Beebe – Founder and CTO, ObjectRocket
  • Luis Motta Campos – Database Administrator, ebay Classifieds Group
  • Colin Charles – Chief Evangelist, MariaDB
  • César Trigo Esteban – Development Director, Gigigo
  • Kenny Gorman – Chief Technologist; Data. Office of the CTO, Rackspace
  • Amrith Kumar – Founder and CTO, Tesora
  • Giuseppe Maxia – Quality Assurance Architect, VMWare
  • Shlomi Noach – Senior Systems Engineer, Booking.com
  • Konstantin Osipov – Engineering Manager, Mail.Ru
  • Morgan Tocker – MySQL Community Manager, Oracle
  • Art van Scheppingen – Head of Database Engineering, Spil Games
  • Charity Majors- Production Engineering Manager, Facebook
  • Peter Zaitsev – Co-founder and CEO, Percona

Sponsorship opportunities for Percona Live Europe 2015 are now available. Sponsors become part of a dynamic and fast-growing ecosystem and interact with hundreds of DBAs, sysadmins, developers, CTOs, CEOs, business managers, technology evangelists, solution vendors, and entrepreneurs who typically attend the event. This year’s conference will feature expanded accommodations and turnkey kiosks.

Planning to attend?

Super Saver registration discounts for Percona Live Europe 2015 are available through July 5th (at 11:30 p.m. CEST). Visit the Percona Live Europe 2015 website for more information about the conference. Interested community members can also register to receive email updates about Percona Live Europe 2015.

Percona has also negotiated a special hotel rate at the Mövenpick Hotel Amsterdam City Centre. If you book your hotel before July 6th your delicious breakfast is included.

I hope to see you in Amsterdam!

The post Percona Live Europe 2015! Call for speakers; registration now open appeared first on MySQL Performance Blog.

Does pt-table-checksum follow slaves in a 2-hop (&amp;quot;chained&amp;quot;) replication?

Lastest Forum Posts - June 9, 2015 - 9:46am
We've been using pt-table-checksum for several years to check replication consistency. Just recently we started getting errors for slaves that are 2 hops away from the master. I was under the impression that pt-table-checksum only checks a master and its IMMEDIATE (i.e. 1 hop) slaves. I looked through the documentation and googled for a while but couldn't find anything saying whether pt-table-checksum finds all slaves in a chained replication topology. Can someone clarify this issue? btw, we are using the "hosts" method of slave discovery in case that makes any difference.

Auditing MySQL with McAfee and MongoDB

Latest MySQL Performance Blog posts - June 9, 2015 - 6:00am

Greetings everyone! Let’s discuss a 3rd Party auditing solution to MySQL and how we can leverage MongoDB® to make sense out of all of that data.

The McAfee MySQL Audit plugin does a great job of capturing, at low level, activities within a MySQL server. It does this through some non-standard APIs which is why installing and configuring the plugin can be a bit difficult. The audit information is stored in JSON format, in a text file, by default.

There is 1 JSON object for each action that takes place within MySQL. If a user logs in, there’s an object. If that user queries a table, there’s an object. Imagine 1000 active connections from an application, each doing 2 queries per second. That’s 2000 JSON objects per second being written to the audit log. After 24 hours, that would be almost 173,000,000 audit entries!

How does one make sense of that many JSON objects? One option would be to write your own parser in $YOUR_FAVORITE_LANGUAGE and convert the JSON to INSERT statements and write the data back to MySQL (Note: If you do this, you can whitelist this table within the plugin so that these INSERTs are not re-audit logged). Or, we can use a system designed to import, store and query JSON objects, such as MongoDB.

Install McAfee Audit Plugin

First we need to download the source code for the plugin and download the source code for the specific MySQL version you are running. This is not a complete step-by-step HOWTO on installing this plugin; just some high-level points.

My client for this exercise is still on Percona Server 5.1.73, so we need the source for that EXACT version from percona.com.

We can clone the mcafee/mysql-audit using git.

Unzip the MySQL source and compile it; just don’t do “make install”, only “./configure” and “make” are necessary.

Now compile the plugin. You may want to read the detailed instructions.

This next step is tricky and really only necessary if you are not using vanilla MySQL. It is a required step to allow the plugin to use those non-standard API’s I mentioned earlier. You need to extract the offsets for the plugin to work. Follow the instructions given.

Once that is all done, you can:

INSTALL PLUGIN AUDIT SONAME 'libaudit_plugin.so';

If the plugin fails to load, check MySQL’s error log for the reason why and confer with the plugin documentation on how to resolve.

We now need to enable audit logging because nothing is enabled by default.

SET GLOBAL audit_record_cmds = "select,insert,update,delete"; SET GLOBAL audit_json_file = ON; SET GLOBAL audit_record_objs = "*.*,{}"; SET GLOBAL audit_force_record_logins = ON;

Look inside @@datadir and you should see a file called mysql-audit.json. You can tail -f this file if you’d like to watch it to make sure data is being written.

If you’d like some more background reading on the audit plugin, check out Fernando’s post on Experiences with McAfee Audit Plugin.

Setting Up MongoDB

Let me begin by stating this is my first time really dealing with MongoDB in any real sense. I spun up an EC2 instance in AWS (m3.large, CentOS 6) and installed MongoDB using yum and the Mongo repositories.

As the ephemeral storage for my instance had been mounted at /opt, I changed just this one option in the supplied /etc/mongod.conf and restarted mongo (service mongod restart).


I then copied the mysql-audit.json from the MySQL host using SSH:

[percona@mysql-host ~]$ scp -i .ssh/amazon.pem /data/mysql/mysql-audit.json root@

Then I imported this JSON file directly into MongoDB:

[root@ip-10-255-8-15 ~]# mongoimport --db test --collection audit --drop --file /tmp/mysql-audit.json

The above mongoimport command specifies the database in which to import (test) and in which collection (audit). I also specify to –drop the database before importing. This drop is necessary because the Audit Plugin appends to JSON file and if we repeated these import steps without the –drop, we would be duplicating data.

If there is enough interest, via the comments below, I will investigate the potential of using the socket functionality of the Audit Plugin to have the events stream directly into mongo.

For now though, it’s a wash-rinse-repeat cycle; though there is the ability to rotate the JSON audit log after a certain amount of time and import each file on a daily basis.

Making Data Make Sense

Here is a sample “document” (ie: audit event) that is created by the Audit Plugin.

{ "_id" : ObjectId("5571ea51b1e714b8d6d804c8"), "msg-type" : "activity", "date" : "1433438419388", "thread-id" : "10214180", "query-id" : "295711011", "user" : "activebatchSVC", "priv_user" : "activebatchSVC", "host" : "ecn.corp", "ip" : "", "cmd" : "select", "objects" : [ { "db" : "", "name" : "*", "obj_type" : "TABLE" }, { "db" : "risque", "name" : "markets_source_tfutvol_eab", "obj_type" : "VIEW" }, { "db" : "historical", "name" : "futureopt", "obj_type" : "TABLE" }, { "db" : "risque", "name" : "securities_futures_optdef", "obj_type" : "TABLE" }, { "db" : "risque", "name" : "markets_source_tfutvol_eab", "obj_type" : "VIEW" }, { "db" : "historical", "name" : "futureopt", "obj_type" : "TABLE" }, { "db" : "risque", "name" : "securities_futures_optdef", "obj_type" : "TABLE" } ], "query" : "SELECT far, bar, baz FROM mytable" }

!! MongoDB BUG !!

Notice that last field in the document is named “query.” When I attempted some basic aggregate() functions on this field, I received errors on bad syntax. After much frustration, lots Googling and repeated testing, I came to the only conclusion that “query” is a reserved word in MongoDB. There is little to no documentation on this, aside from an almost 3 year old bug report that simply helped confirm my suspicion.

To work around the above bug issue, let’s rename all of the “query” fields to “qry”:

db.audit.update({}, { $rename: { "query": "qry"} }, false, true);

Now we can begin.

Basic Command Counters

Using any of the “top level” fields in each document, we can run reports (called aggregates in Mongo). So an easy one is to get a list of all unique “commands” and how many times they occurred.

> db.audit.aggregate([ { $group: { "_id": "$cmd", "count": { $sum: 1 } } } ]); { "_id" : "Failed Login", "count" : 2 } { "_id" : "select", "count" : 458366 } { "_id" : "Connect", "count" : 455090 } { "_id" : "insert", "count" : 2 } { "_id" : "Quit", "count" : 445025 } { "_id" : null, "count" : 1 }

Breaking down the command above, we are grouping all values in the “cmd” field and counting them up. The SQL equivalent would be:

SELECT cmd, count(cmd) FROM audit GROUP BY cmd;

User Counts

Let’s get a list and count of all user activities. This will include any of the commands listed in the previous aggregate.

> db.audit.aggregate([ { $group: { "_id": "$user", "count": { $sum: 1 } } } ]); { "_id" : "baw", "count" : 1883 } { "_id" : "eq_shrd", "count" : 1 } { "_id" : "reski", "count" : 3452 } { "_id" : "alin", "count" : 1 } { "_id" : "oey", "count" : 62 } { "_id" : "dule", "count" : 380062 } { "_id" : "ashi", "count" : 802 } { "_id" : "tech_shrd", "count" : 392464 }

A couple interesting things come out here. Firstly, the tech_shrd user does the most ‘activities’ over all other users. Is this expected? Is this normal? Your environment will determine that.

Specific User Activities

Let’s pick a specific user and get their activity counts to make sure they aren’t doing anything weird.

> db.audit.aggregate([ ... { $match: { "user": "tech_shrd" } }, ... { $group: { "_id": "$cmd", "count": { $sum: 1 } } } ... ]); { "_id" : "select", "count" : 132970 } { "_id" : "Connect", "count" : 133919 } { "_id" : "Quit", "count" : 125575 }

The SQL equivalent:

SELECT cmd, count(cmd) FROM audit WHERE user = 'tech_shrd';

Activities By User

We saw above that there were 2 insert commands. Who ran those?

> db.audit.aggregate([ ... { $match: { "cmd": "insert" } }, ... { $group: { "_id": "$user", "count": { $sum: 1 } } } ... ]); { "_id" : "graz", "count" : 2 }

More simply, we could have just done this to see the entire document/record which would include the SQL that the user executed, timestamp, hostname, etc.

> db.audit.find({ "cmd": "insert" });

The SQL equivalents:

SELECT user, count(user) FROM audit WHERE cmd = 'insert'; SELECT * FROM audit WHERE cmd = 'insert';

Table Activity

The most complex example I could come up with was trying to find out how many times each table was referenced. In theory, with weeks or even months of audit data, we could decide which tables aren’t needed any longer by the application.

> db.audit.aggregate( ... { $unwind: "$objects" }, ... { $group: { _id : "$objects.name", count: { $sum: 1 } } }, ... { $sort: { "count": -1 } } ... ); { "_id" : "*", "count" : 17359 } { "_id" : "swaps", "count" : 4392 } { "_id" : "futureopt", "count" : 3666 } ...(more)

You’ll notice in the sample document above that “objects” is an array of objects with 1 element for each table/view referenced in the ‘qry’ field. We need to “unwind” this array into single elements before we can count them. If someone knows a better way, please let me know. The Audit Plugin uses “*” to represent a derived table from a sub-SELECT, which has no proper name. We can remove all of these using:

> db.audit.update({ }, { $pull: { "objects": { "name": "*" } } }, false, true);

Audit Plugin Caveat: The ‘objects’ array is not a distinct list of the tables involved. For example, a SELECT statement that self-joins twice would produce 3 identical elements in the ‘objects’ array for that audit record. This may skew results. If anyone knows a cool Mongo trick to remove duplicates, please share in the comments.


For a quick wrap-up, we installed the McAfee Audit Plugin, exported some audit data, set up a MongoDB instance in AWS and imported the audit data. As you can see, the possibilities are plentiful on what kind of information you can gather. Feel free to comment on an aggregation you’d like to see if we were running this type of audit on your system.


The post Auditing MySQL with McAfee and MongoDB appeared first on MySQL Performance Blog.

only want to do heathcheck

Lastest Forum Posts - June 8, 2015 - 11:03pm
what sort of configuration that I need to configure to know which cluster node that is still up or down?

any configuration sample?

Why pt-table-checksum diffs 0 while my master and slave are certainly different?

Lastest Forum Posts - June 8, 2015 - 8:02pm
: pt-table-checksum :2.2.14 MySQL:5.6.19 Binlog:row : Master: mysql> select * from test.gf; +------+ | id | +------+ | 2 | | 5 | | 7 | +------+ Slave: mysql> select * from test.gf; +------+ | id | +------+ | 7 | +------+ So master and slave are certainly different,but running pt-table-checksum shows no difference:
: ./pt-table-checksum h=localhost,u=root,p=PASSWORD,P=3306 ...... other table checking result ...... 06-09T10:55:31 0 0 3 1 0 0.003 test.gf And percona.checksums table in slave shows difference crc:
: select * from percona.checksums; ...... other output rows ...... | test | gf | 1 | 0.000625 | NULL | NULL | NULL | 46156dca | 1 | 435a7b4f | 3 | 2015-06-09 10:55:31 | My question is since percona.checksums table in slave shows that crc checking diffs in master and slave ,why pt-table-checksum report's diffs column shows 0?

VM vs Dedicated

Lastest Forum Posts - June 8, 2015 - 9:27am

My organization decided to deploy applications on VM environment, one VM per application for controlling resource (CPU, RAM, I/O, space) usage of each application. We have some option for MySQL database which are same VM with application, separate VM (one VM per databases related to each application) or dedicated servers (all databases on same servers). I'm looking for pros vs cons of using MySQL on VM (same vs separate) environment vs dedicated server environment. I don't have knowledge of resource controlling on MySQL (if any).

Best regards,

MaxScale: A new tool to solve your MySQL scalability problems

Latest MySQL Performance Blog posts - June 8, 2015 - 6:00am

Ever since MySQL replication has existed, people have dreamed of a good solution to automatically split read from write operations, sending the writes to the MySQL master and load balancing the reads over a set of MySQL slaves. While if at first it seems easy to solve, the reality is far more complex.

First, the tool needs to make sure it parses and analyses correctly all the forms of SQL MySQL supports in order to sort writes from reads, something that is not as easy as it seems. Second, it needs to take into account if a session is in a transaction or not.

While in a transaction, the default transaction isolation level in InnoDB, Repeatable-read, and the MVCC framework insure that you’ll get a consistent view for the duration of the transaction. That means all statements executed inside a transaction must run on the master but, when the transaction commits or rollbacks, the following select statements on the session can be again load balanced to the slaves, if the session is in autocommit mode of course.

Then, what do you do with sessions that set variables? Do you restrict those sessions to the master or you replay them to the slave? If you replay the set variable commands, you need to associate the client connection to a set of MySQL backend connections, made of at least a master and a slave. What about temporary objects like with “create temporary table…”? How do you deal when a slave lags behind or what if worse, replication is broken? Those are just a few of the challenges you face when you want to build a tool to perform read/write splitting.

Over the last few years, a few products have tried to tackle the read/write split challenge. The MySQL_proxy was the first attempt I am aware of at solving this problem but it ended up with many limitations. ScaleARC does a much better job and is very usable but it stills has some limitations. The latest contender is MaxScale from MariaDB and this post is a road story of my first implementation of MaxScale for a customer.

Let me first introduce what is MaxScale exactly. MaxScale is an open source project, developed by MariaDB, that aims to be a modular proxy for MySQL. Most of the functionality in MaxScale is implemented as modules, which includes for example, modules for the MySQL protocol, client side and server side.

Other families of available modules are routers, monitors and filters. Routers are used to determine where to send a query, Read/Write splitting is accomplished by the readwritesplit router. The readwritesplit router uses an embedded MySQL server to parse the queries… quite clever and hard to beat in term of query parsing.

There are other routers available, the readconnrouter is basically a round-robin load balancer with optional weights, the schemarouter is a way to shard your data by schema and the binlog router is useful to manage a large number of slaves (have a look at Booking.com’s Jean-François Gagné’s talk at PLMCE15 to see how it can be used).

Monitors are modules that maintain information about the backend MySQL servers. There are monitors for a replicating setup, for Galera and for NDB cluster. Finally, the filters are modules that can be inserted in the software stack to manipulate the queries and the resultsets. All those modules have well defined APIs and thus, writing a custom module is rather easy, even for a non-developer like me, basic C skills are needed though. All event handling in MaxScale uses epoll and it supports multiple threads.

Over the last few months I worked with a customer having a challenging problem. On a PXC cluster, they have more than 30k queries/s and because of their write pattern and to avoid certification issues, they want to have the possibility to write to a single node and to load balance the reads. The application is not able to do the Read/Write splitting so, without a tool to do the splitting, only one node can be used for all the traffic. Of course, to make things easy, they use a lot of Java code that set tons of sessions variables. Furthermore, for ISO 27001 compliance, they want to be able to log all the queries for security analysis (and also for performance analysis, why not?). So, high query rate, Read/Write splitting and full query logging, like I said a challenging problem.

We experimented with a few solutions. One was a hardware load balancer that failed miserably – the implementation was just too simple, using only regular expressions. Another solution we tried was ScaleArc but it needed many rules to whitelist the set session variables and to repeat them to multiple servers. ScaleArc could have done the job but all the rules increases the CPU load and the cost is per CPU. The queries could have been sent to rsyslog and aggregated for analysis.

Finally, the HA implementation is rather minimalist and we had some issues with it. Then, we tried MaxScale. At the time, it was not GA and was (is still) young. Nevertheless, I wrote a query logging filter module to send all the queries to a Kafka cluster and we gave it a try. Kafka is extremely well suited to record a large flow of queries like that. In fact, at 30k qps, the 3 Kafka nodes are barely moving with cpu under 5% of one core. Although we encountered some issues, remember MaxScale is very young, it appeared to be the solution with the best potential and so we moved forward.

The folks at MariaDB behind MaxScale have been very responsive to the problems we encountered and we finally got to a very usable point and the test in the pilot environment was successful. The solution is now been deployed in the staging environment and if all goes well, it will be in production soon. The following figure is simplified view of the internals of MaxScale as configured for the customer:

The blocks in the figure are nearly all defined in the configuration file. We define a TCP listener using the MySQL protocol (client side) which is linked with a router, either the readwritesplit router or the readconn router.

The first step when routing a query is to assign the backends. This is where the read/write splitting decision is made. Also, as part of the steps required to route a query, 2 filters are called, regexp (optional) and Genlog. The regexp filter may be used to hot patch a query and the Genlog filter is the logging filter I wrote for them. The Genlog filter will send a json string containing about what can be found in the MySQL general query log plus the execution time.

Authentication attempts are also logged but the process is not illustrated in the figure. A key point to note, the authentication information is cached by MaxScale and is refreshed upon authentication failure, the refresh process is throttled to avoid overloading the backend servers. The servers are continuously monitored, the interval is adjustable, and the server status are used when the decision to assign a backend for a query is done.

In term of HA, I wrote a simple Pacemaker resource agent for MaxScale that does a few fancy things like load balancing with IPTables (I’ll talk about that in future post). With Pacemaker, we have a full fledge HA solution with quorum and fencing on which we can rely.

Performance wise, it is very good – a single core in a virtual environment was able to read/write split and log to Kafka about 10k queries per second. Although MaxScale supports multiple threads, we are still using a single thread per process, simply because it yields a slightly higher throughput and the custom Pacemaker agent deals with the use of a clone set of MaxScale instances. Remember we started early using MaxScale and the beta versions were not dealing gracefully with threads so we built around multiple single threaded instances.

So, since a conclusion is needed, MaxScale has proven to be a very useful and flexible tool that allows to elaborate solutions to problems that were very hard to tackle before. In particular, if you need to perform read/write splitting, then, try MaxScale, it is best solution for that purpose I have found so far. Keep in touch, I’ll surely write other posts about MaxScale in the near future.

The post MaxScale: A new tool to solve your MySQL scalability problems appeared first on MySQL Performance Blog.

fail joining cluster

Lastest Forum Posts - June 5, 2015 - 1:20pm
Hello, I am trying to set up an xtradb cluster on centos 7.

I have successfully bootstrapped the first node when I try to join a node to the cluster on the donor I cannot see any error however the logs are here:


but on the joiner there are errors: