Buy Percona SupportBuy Now

Migrating 1TB of data from mysql 5.1.56 to maria 10.0.26

Lastest Forum Posts - August 26, 2016 - 7:34am
So far, I have been unsuccessful in using xtrabackup to do the migration.

Errors seen:

version not supported -- mysql side .. version of xtrabackup had dropped support for 5.1.56
backup is corrupt -- on maria side - md5sums matched
failed to start -- from maria after extracting the backup (nothing informative in the maria logs, no errors popped by xtrabackup)


So has anyone successfully done such a migration or have thoughts on version(s) of xtrabackup to use ?

Thanks

Memory usage of InMemory Storage Engine

Lastest Forum Posts - August 25, 2016 - 8:06pm
I have tested MongoDB InMemory storage engine.

Each document is about 50~100 bytes and there's only 50000 document in the collection (Just one collection in mongodb instance).
And there's 3 indexes( PK, TTL index, and composite 2dsphere index).

Made 4K connections and load generator ran 45K/second update (there's only update which change TTL and 2dsphere index)

$ ./bin/mongod --storageEngine=inMemory --inMemorySizeGB=3

After 3 day's running, MongoDB instance consume about 18.5GB memory. But db.serverStatus().inMemory.cache."bytes currently in the cache" metric is only below 3GB.
I don't know this is normal that MongoDB instance use 15GB only for client connection handler threads.
But after stop load generator and close all connection, mongodb instance memory usage doesn't show any changes.

So I tried google perf tools with rebuilt mongodb(--allocator=system)

$ env HEAPPROFILE=/data/profile/hprof HEAP_PROFILE_ALLOCATION_INTERVAL=10737418240 HEAP_PROFILE_INUSE_INTERVAL=1073741824 HEAP_PROFILE_TIME_INTERVAL=600 ./bin/mongod -f ./mongod.conf

After 12 hours running, MongoDB instance is using 5.0GB and wired tiger cache is using only 650MB(678575506 bytes).
And generate memory usage graph with pprof. But more weird things is that pprof graph shows only 777MB memory usage.

Does mongodb allocate memory out side of tcmalloc api ?
Are there any missing point in my profiling procedure ?

Regards,
Matt.

auto restart on kill

Lastest Forum Posts - August 25, 2016 - 6:48am
Sorry if this is a FAQ - I've googled high and low for an answer and not found anything specific.

I'm running Percona 5.6.29 under ubuntu 14.04. Recently, I had to issue "kill -9" to shut down a mysqld process (I really had no option - I'd issued a normal service shutdown command which had triggered mysqladmin shutdown; the error log showed clearly that the shutdown process had started; after 20 minutes, it had still not shut down fully; I verified that no data was being written to either the error log or to any files in the database directory.)

After the "kill -9", mysqld restarted automatically (and - fortunately - successfully.)

My question is - how can mysqld be prevented from restarting automatically after a "kill -9"? I appreciate that under normal conditions, if mysql crashes for some reason, it should try to restart (I think.) However, in this particular case, I really didn't want it to restart before I had had a chance to look through the logs and do other forensic checks and cleanups. If there had been a serious error condition, an automatic restart could have resulted in even more damage.


mongodb-consistent-backup

Lastest Forum Posts - August 25, 2016 - 6:20am
Bit confused how to differentiate/use the backups.... env... 3 data nodes, 3 config one mongos... the data nodes on 27018,27019, 27020 the mongos 27017, config 27021,27022,27023.

The different backups output...
Using the MONGOS port 27017 connection:
dom@db-linux ~/Documents/mongodb/Percona_backup/mongodb_consistent_backup/bin $ ./mongodb-consistent-backup -H localhost -P 27017 -n testcluster -l /home/mongo/mongobackups
[2016-08-25 11:46:12,907] [INFO] [MainProcess] [Backup:run:167] Starting mongodb-consistent-backup version 0.2.1 (git commit hash: 669c7dbcf57161cfe1f94a9cf3f50249ad292832)
[2016-08-25 11:46:12,907] [INFO] [MainProcess] [Backup:run:199] Running backup of localhost:27017 in sharded mode
[2016-08-25 11:46:12,910] [INFO] [MainProcess] [ShardingHandler:get_start_state:32] Began with balancer state: True
[2016-08-25 11:46:12,910] [INFO] [MainProcess] [ShardingHandler:stop_balancer:80] Stopping the balancer and waiting a max of 300 sec
[2016-08-25 11:46:22,921] [INFO] [MainProcess] [ShardingHandler:stop_balancer:90] Balancer is now stopped
[2016-08-25 11:46:23,033] [INFO] [OplogTail-1] [Tail:run:66] Tailing oplog on 127.0.0.1:27019 for changes
[2016-08-25 11:46:23,052] [INFO] [MainProcess] [Mongodumper:run:112] Starting backups in threads using mongodump r3.2.9 (inline gzip: True)
[2016-08-25 11:46:23,054] [INFO] [Mongodump-2] [Mongodump:run:49] Starting mongodump (with oplog) backup of rs0/127.0.0.1:27019
[2016-08-25 11:46:23,056] [INFO] [Mongodump-3] [Mongodump:run:49] Starting mongodump (with oplog) backup of config/127.0.0.1:27021
[2016-08-25 11:46:23,175] [INFO] [Mongodump-2] [Mongodump:run:90] Backup for rs0/127.0.0.1:27019 completed in 0.123189926147 sec with 0 oplog changes captured to: None
[2016-08-25 11:46:23,193] [INFO] [Mongodump-3] [Mongodump:run:90] Backup for config/127.0.0.1:27021 completed in 0.141259908676 sec with 0 oplog changes captured to: None
[2016-08-25 11:46:26,198] [INFO] [MainProcess] [Mongodumper:run:137] All mongodump backups completed
[2016-08-25 11:46:26,198] [INFO] [MainProcess] [Tailer:stop:55] Stopping oplog tailing threads
[2016-08-25 11:46:27,041] [INFO] [OplogTail-1] [Tail:run:119] Done tailing oplog on 127.0.0.1:27019, 0 changes captured to: None
[2016-08-25 11:46:27,198] [INFO] [MainProcess] [Tailer:stop:61] Stopped all oplog threads
[2016-08-25 11:46:27,199] [INFO] [MainProcess] [ShardingHandler:restore_balancer_state:73] Restoring balancer state to: True
[2016-08-25 11:46:27,208] [INFO] [MainProcess] [Resolver:run:51] Resolving oplogs using 8 threads max
[2016-08-25 11:46:27,208] [INFO] [MainProcess] [Resolver:run:63] No oplog changes to resolve for 127.0.0.1:27019
[2016-08-25 11:46:27,208] [INFO] [MainProcess] [Resolver:run:83] No tailed oplog for host 127.0.0.1:27021
[2016-08-25 11:46:27,311] [INFO] [MainProcess] [Resolver:run:95] Done resolving oplogs
[2016-08-25 11:46:27,314] [INFO] [MainProcess] [Archiver:run:86] Archiving backup directories with 4 threads max
[2016-08-25 11:46:27,315] [INFO] [PoolWorker-12] [Archiver:run:55] Archiving directory: /home/mongo/mongobackups/testcluster/20160825_1146/rs0
[2016-08-25 11:46:27,315] [INFO] [PoolWorker-13] [Archiver:run:55] Archiving directory: /home/mongo/mongobackups/testcluster/20160825_1146/config
[2016-08-25 11:46:27,516] [INFO] [MainProcess] [Archiver:run:102] Archiver threads completed
[2016-08-25 11:46:27,516] [INFO] [MainProcess] [Backup:run:317] Backup completed in 14.6180999279 sec
Using the MONGOD port 27018
dom@db-linux ~/Documents/mongodb/Percona_backup/mongodb_consistent_backup/bin $ ./mongodb-consistent-backup -H localhost -P 27018 -n testcluster -l /home/mongo/mongobackups
[2016-08-25 14:14:51,778] [INFO] [MainProcess] [Backup:run:167] Starting mongodb-consistent-backup version 0.2.1 (git commit hash: 669c7dbcf57161cfe1f94a9cf3f50249ad292832)
[2016-08-25 14:14:51,778] [INFO] [MainProcess] [Backup:run:176] Running backup of localhost:27018 in replset mode
[2016-08-25 14:14:51,798] [INFO] [MainProcess] [Mongodumper:run:53] Found replset rs0, using secondary instance 127.0.0.1:27019 for backup
[2016-08-25 14:14:51,799] [INFO] [MainProcess] [Mongodumper:run:112] Starting backups in threads using mongodump r3.2.9 (inline gzip: True)
[2016-08-25 14:14:51,800] [INFO] [Process-2] [Mongodump:run:49] Starting mongodump (with oplog) backup of rs0/127.0.0.1:27019
[2016-08-25 14:14:52,013] [INFO] [Process-2] [Mongodump:run:90] Backup for rs0/127.0.0.1:27019 completed in 0.213969945908 sec with 0 oplog changes captured to: None
[2016-08-25 14:14:55,016] [INFO] [MainProcess] [Mongodumper:run:137] All mongodump backups completed
[2016-08-25 14:14:55,019] [INFO] [MainProcess] [Archiver:run:86] Archiving backup directories with 1 threads max
[2016-08-25 14:14:55,020] [INFO] [PoolWorker-3] [Archiver:run:55] Archiving directory: /home/mongo/mongobackups/testcluster/20160825_1414/rs0
[2016-08-25 14:14:55,220] [INFO] [MainProcess] [Archiver:run:102] Archiver threads completed
[2016-08-25 14:14:55,220] [INFO] [MainProcess] [Backup:run:317] Backup completed in 3.45086193085 sec
Using a config server connection port
dom@db-linux ~/Documents/mongodb/Percona_backup/mongodb_consistent_backup/bin $ ./mongodb-consistent-backup -H localhost -P 27021 -n testcluster -l /home/mongo/mongobackups
[2016-08-25 15:39:47,922] [INFO] [MainProcess] [Backup:run:167] Starting mongodb-consistent-backup version 0.2.1 (git commit hash: 669c7dbcf57161cfe1f94a9cf3f50249ad292832)
[2016-08-25 15:39:47,923] [INFO] [MainProcess] [Backup:run:176] Running backup of localhost:27021 in replset mode
[2016-08-25 15:39:47,937] [INFO] [MainProcess] [Mongodumper:run:53] Found replset ConfigSvr, using secondary instance 127.0.0.1:27022 for backup
[2016-08-25 15:39:47,937] [INFO] [MainProcess] [Mongodumper:run:112] Starting backups in threads using mongodump r3.2.9 (inline gzip: True)
[2016-08-25 15:39:47,938] [INFO] [Process-2] [Mongodump:run:49] Starting mongodump (with oplog) backup of ConfigSvr/127.0.0.1:27022
[2016-08-25 15:39:48,164] [INFO] [Process-2] [Mongodump:run:90] Backup for ConfigSvr/127.0.0.1:27022 completed in 0.227024078369 sec with 0 oplog changes captured to: None
[2016-08-25 15:39:51,168] [INFO] [MainProcess] [Mongodumper:run:137] All mongodump backups completed
[2016-08-25 15:39:51,171] [INFO] [MainProcess] [Archiver:run:86] Archiving backup directories with 1 threads max
[2016-08-25 15:39:51,171] [INFO] [PoolWorker-3] [Archiver:run:55] Archiving directory: /home/mongo/mongobackups/testcluster/20160825_1539/ConfigSvr
[2016-08-25 15:39:51,372] [INFO] [MainProcess] [Archiver:run:102] Archiver threads completed
[2016-08-25 15:39:51,372] [INFO] [MainProcess] [Backup:run:317] Backup completed in 3.45891904831 sec
dom@db-linux ~/Documents/mongodb/Percona_backup/mongodb_consistent_backup/bin $


The backup files are in the /home/mongo/mongobackups directory.... but the restore notes only mention how to deal with the data.... the script will allow you to backup either through the mongod, mongos or the config port (see above)... but which should you use to obtain a consistent backup of the data on the cluster and how do you restore it?

Best practice for backup/restoring a single database throughout the PXC

Lastest Forum Posts - August 24, 2016 - 3:25am
Hi guys.
I am experimenting with PXC and my plan is to set up a 5 node cluster which should contain 2 databases for two separate applications. I am trying to figure out how I should backup each of the databases and how to restore only database1 to all nodes in the cluster without downtime for database2. I cannot find this in the documentation. Can you please advise.
Thank you.

XtraDB Docker Cluster Recovery From Replica

Lastest Forum Posts - August 24, 2016 - 2:58am
Hey, I'm pretty new to Percona, and trying to figure out how to set a Percona pxc DB architecture that's HA cross-region in AWS using Docker. Creating nodes outside of the main region will cause latency issues so I want to create a slave replica cluster in a different region, however I want to make the failover and recovery process more automatic, to avoid having to set up multiple DBs for multiple services and not have to worry about dealing with manual failover recovery for all these DBs. Is that even possible, and if so then how?

Percona Server 5.7.14-7 is now available

Latest MySQL Performance Blog posts - August 23, 2016 - 10:57am

Percona announces the GA release of Percona Server 5.7.14-7 on August 23, 2016. Download the latest version from the Percona web site or the Percona Software Repositories.

Based on MySQL 5.7.14, including all the bug fixes in it, Percona Server 5.7.14-7 is the current GA release in the Percona Server 5.7 series. Percona’s provides completely open-source and free software. Find release details in the 5.7.14-7 milestone at Launchpad.

New Features: Bugs Fixed:
  • Fixed potential cardinality 0 issue for TokuDB tables if ANALYZE TABLE finds only deleted rows and no actual logical rows before it times out. Bug fixed #1607300 (#1006, #732).
  • TokuDB database.table.index names longer than 256 characters could cause a server crash if background analyze table status was checked while running. Bug fixed #1005.
  • PAM Authentication Plugin would abort authentication while checking UNIX user group membership if there were more than a thousand members. Bug fixed #1608902.
  • If DROP DATABASE would fail to delete some of the tables in the database, the partially-executed command is logged in the binlog as DROP TABLE t1, t2, ... for the tables for which drop succeeded. A slave might fail to replicate such DROP TABLE statement if there exist foreign key relationships to any of the dropped tables and the slave has a different schema from the master. Fix by checking, on the master, whether any of the database to be dropped tables participate in a Foreign Key relationship, and fail the DROP DATABASE statement immediately. Bug fixed #1525407 (upstream #79610).
  • PAM Authentication Plugin didn’t support spaces in the UNIX user group names. Bug fixed #1544443.
  • Due to security reasons ld_preload libraries can now only be loaded from the system directories (/usr/lib64, /usr/lib) and the MySQL installation base directory.
  • In the client library, any EINTR received during network I/O was not handled correctly. Bug fixed #1591202 (upstream #82019).
  • SHOW GLOBAL STATUS was locking more than the upstream implementation which made it less suitable to be called with high frequency. Bug fixed #1592290.
  • The included .gitignore in the percona-server source distribution had a line *.spec, which means someone trying to check in a copy of the percona-server source would be missing the spec file required to build the RPMs. Bug fixed #1600051.
  • Audit Log Plugin did not transcode queries. Bug fixed #1602986.
  • If the changed page bitmap redo log tracking thread stops due to any reason, then shutdown will wait for a long time for the log tracker thread to quit, which it never does. Bug fixed #1606821.
  • Changed page tracking was initialized too late by InnoDB. Bug fixed #1612574.
  • Fixed stack buffer overflow if --ssl-cipher had more than 4000 characters. Bug fixed #1596845 (upstream #82026).
  • Audit Log Plugin events did not report the default database. Bug fixed #1435099.
  • Canceling the TokuDB Background ANALYZE TABLE job twice or while it was in the queue could lead to server assertion. Bug fixed #1004.
  • Fixed various spelling errors in comments and function names. Bug fixed #728 (Otto Kekäläinen).
  • Implemented set of fixes to make PerconaFT build and run on the AArch64 (64-bit ARMv8) architecture. Bug fixed #726 (Alexey Kopytov).
Other bugs fixed:

#1542874 (upstream #80296), #1610242, #1604462 (upstream #82283), #1604774 (upstream #82307), #1606782, #1607359, #1607606, #1607607, #1607671, #1609422, #1610858, #1612551, #1613663, #1613986, #1455430, #1455432, #1581195, #998, #1003, and #730.

The release notes for Percona Server 5.7.14-7 are available in the online documentation. Please report any bugs on the launchpad bug tracker .

Percona Server 5.6.32-78.0 is now available

Latest MySQL Performance Blog posts - August 22, 2016 - 3:44pm

Percona announces the release of Percona Server 5.6.32-78.0 on August 22nd, 2016. Download the latest version from the Percona web site or the Percona Software Repositories.

Based on MySQL 5.6.32, including all the bug fixes in it, Percona Server 5.6.32-78.0 is the current GA release in the Percona Server 5.6 series. Percona Server is open-source and free – this is the latest release of our enhanced, drop-in replacement for MySQL. Complete details of this release are available in the 5.6.32-78.0 milestone on Launchpad.

New Features: Bugs Fixed:
  • Fixed potential cardinality 0 issue for TokuDB tables if ANALYZE TABLE finds only deleted rows and no actual logical rows before it times out. Bug fixed #1607300 (#1006, #732).
  • TokuDB database.table.index names longer than 256 characters could cause server crash if background analyze table status was checked while running. Bug fixed #1005.
  • PAM Authentication Plugin would abort authentication while checking UNIX user group membership if there were more than a thousand members. Bug fixed #1608902.
  • If DROP DATABASE would fail to delete some of the tables in the database, the partially-executed command is logged in the binlog as DROP TABLE t1, t2, ... for the tables for which drop succeeded. A slave might fail to replicate such DROP TABLE statement if there exist foreign key relationships to any of the dropped tables and the slave has a different schema from master. Fix by checking, on the master, whether any of the database to be dropped tables participate in a Foreign Key relationship, and fail the DROP DATABASE statement immediately. Bug fixed #1525407 (upstream #79610).
  • PAM Authentication Plugin didn’t support spaces in the UNIX user group names. Bug fixed #1544443.
  • Due to security reasons ld_preload libraries can now only be loaded from the system directories (/usr/lib64, /usr/lib) and the MySQL installation base directory.
  • Percona Server 5.6 could not be built with the -DMYSQL_MAINTAINER_MODE=ON option. Bug fixed #1590454.
  • In the client library, any EINTR received during network I/O was not handled correctly. Bug fixed #1591202 (upstream #82019).
  • The included .gitignore in the percona-server source distribution had a line *.spec, which means someone trying to check in a copy of the percona-server source would be missing the spec file required to build the RPMs. Bug fixed #1600051.
  • Audit Log Plugin did not transcode queries. Bug fixed #1602986.
  • LeakSanitizer-enabled build failed to bootstrap server for MTR. Bug fixed #1603978 (upstream #81674).
  • Fixed MYSQL_SERVER_PUBLIC_KEY connection option memory leak. Bug fixed #1604419.
  • The fix for bug #1341067 added a call to free some of the heap memory allocated by OpenSSL. This is not safe for repeated calls if OpenSSL is linked twice through different libraries and each is trying to free the same. Bug fixed #1604676.
  • If the changed page bitmap redo log tracking thread stops due to any reason, then shutdown will wait for a long time for the log tracker thread to quit, which it never does. Bug fixed #1606821.
  • Audit Log Plugin events did not report the default database. Bug fixed #1435099.
  • Canceling the TokuDB Background ANALYZE TABLE job twice or while it was in the queue could lead to server assertion. Bug fixed #1004.
  • Fixed various spelling errors in comments and function names. Bug fixed #728 (Otto Kekäläinen).
  • Implemented set of fixes to make PerconaFT build and run on the AArch64 (64-bit ARMv8) architecture. Bug fixed #726 (Alexey Kopytov).

Other bugs fixed: #1603073, #1604323, #1604364, #1604462, #1604774, #1606782, #1607224, #1607359, #1607606, #1607607, #1607671, #1608385, #1608437, #1608845, #1609422, #1610858, #1612084, #1612551, #1455430, #1455432, #1610242, #998, #1003, #729, and #730.

Release notes for Percona Server 5.6.32-78.0 are available in the online documentation. Please report any bugs on the launchpad bug tracker.

Query rewrite plugin: scalability fix in MySQL 5.7.14

Latest MySQL Performance Blog posts - August 22, 2016 - 2:46pm

In this post, we’ll look at a scalability fix for issues the query rewrite plugin had on performance.

Several months ago, Vadim blogged about the impact of a query rewrite plugin on performance. We decided to re-evaluate the latest release of 5.7(5.7.14), which includes fixes for this issue.

I reran tests for MySQL 5.7.13 and 5.7.14 using the same setup and the same test: sysbench OLTP_RO without and with the query rewrite plugin enabled.

MySQL 5.7.14 performs much better, with almost no overhead. Let’s check PMP for these runs:

MySQL 5.7.13

206 __lll_lock_wait(libpthread.so.0),pthread_mutex_lock(libpthread.so.0),plugin_unlock_list,mysql_audit_release,handle_connection,pfs_spawn_thread,start_thread(libpthread.so.0),clone(libc.so.6) 152 __lll_lock_wait(libpthread.so.0),pthread_mutex_lock(libpthread.so.0),plugin_foreach_with_mask,mysql_audit_acquire_plugins,mysql_audit_notify,invoke_pre_parse_rewrite_plugins,mysql_parse,dispatch_command,do_command,handle_connection,pfs_spawn_thread,start_thread(libpthread.so.0),clone(libc.so.6) 97 __lll_lock_wait(libpthread.so.0),pthread_mutex_lock(libpthread.so.0),plugin_lock,acquire_plugins,plugin_foreach_with_mask,mysql_audit_acquire_plugins,mysql_audit_notify,invoke_pre_parse_rewrite_plugins,mysql_parse,dispatch_command,do_command,handle_connection,pfs_spawn_thread,start_thread(libpthread.so.0),clone(libc.so.6) 34 __io_getevents_0_4(libaio.so.1),LinuxAIOHandler::collect,LinuxAIOHandler::poll,os_aio_handler,fil_aio_wait,io_handler_thread,start_thread(libpthread.so.0),clone(libc.so.6) 18 send(libpthread.so.0),vio_write,net_write_packet,net_flush,net_send_eof,THD::send_statement_status,dispatch_command,do_command,handle_connection,pfs_spawn_thread,start_thread(libpthread.so.0),clone(libc.so.6) 9 recv(libpthread.so.0),vio_read,net_read_raw_loop,net_read_packet,my_net_read,Protocol_classic::read_packet,Protocol_classic::get_command,do_command,handle_connection,pfs_spawn_thread,start_thread(libpthread.so.0),clone(libc.so.6) 8 poll(libc.so.6),vio_io_wait,vio_socket_io_wait,vio_read,net_read_raw_loop,net_read_packet,my_net_read,Protocol_classic::read_packet,Protocol_classic::get_command,do_command,handle_connection,pfs_spawn_thread,start_thread(libpthread.so.0),clone(libc.so.6)

MySQL 5.7.14

309 send(libpthread.so.0),vio_write,net_write_packet,net_flush,net_send_eof,THD::send_statement_status,dispatch_command,do_command,handle_connection,pfs_spawn_thread,start_thread(libpthread.so.0),clone(libc.so.6) 43 recv(libpthread.so.0),vio_read,net_read_raw_loop,net_read_packet,my_net_read,Protocol_classic::read_packet,Protocol_classic::get_command,do_command,handle_connection,pfs_spawn_thread,start_thread(libpthread.so.0),clone(libc.so.6) 34 __io_getevents_0_4(libaio.so.1),LinuxAIOHandler::collect,LinuxAIOHandler::poll,os_aio_handler,fil_aio_wait,io_handler_thread,start_thread(libpthread.so.0),clone(libc.so.6) 15 poll(libc.so.6),vio_io_wait,vio_socket_io_wait,vio_read,net_read_raw_loop,net_read_packet,my_net_read,Protocol_classic::read_packet,Protocol_classic::get_command,do_command,handle_connection,pfs_spawn_thread,start_thread(libpthread.so.0),clone(libc.so.6) 7 send(libpthread.so.0),vio_write,net_write_packet,net_flush,net_send_ok,Protocol_classic::send_ok,THD::send_statement_status,dispatch_command,do_command,handle_connection,pfs_spawn_thread,start_thread(libpthread.so.0),clone(libc.so.6) 7 pthread_cond_wait,os_event::wait_low,srv_worker_thread,start_thread(libpthread.so.0),clone(libc.so.6) 7 pthread_cond_wait,os_event::wait_low,buf_flush_page_cleaner_worker,start_thread(libpthread.so.0),clone(libc.so.6)

No sign of extra locks with the plugin API in PMP for MySQL 5.7.14. Good job!

Also, note that the fix for this issue should help to improve performance for any other audit based API plugins.

Very poor read performance after moving to Galera cluster

Lastest Forum Posts - August 22, 2016 - 2:29pm
Hi,

We have recently moved to a Galera Cluster and have noticed that query performance has dropped rather radically. I have to admit that I am extremely new to Galera and hence I have not grokked all nuances of the parameters one can pass to it. My setup is a rather basic one and I have, for instance, read that perhaps I should look into gcache size.

So, have read this I realise that there must be more of which I am embarrassingly aware and I would very much appreciate any tips as to how to improve my performance.


Cheers,


K

Top Most Overlooked MySQL Performance Optimizations: Q & A

Latest MySQL Performance Blog posts - August 19, 2016 - 4:34pm

Thank you for attending my 22nd July 2016 webinar titled “Top Most Overlooked MySQL Performance Optimizations“. In this blog, I will provide answers to the Q & A for that webinar.

For hardware, which disk raid level do you suggest? Is raid5 suggested performance-wise and data-integrity-wise?
RAID 5 comes with high overhead, as each write turns into a sequence of four physical I/O operations, two reads and two writes. We know that RAID 5s have some write penalty, and it could affect the performance on spindle disks. In most cases, we advise using alternative RAID levels. Use RAID 5 when disk capacity is more important than performance (e.g., archive databases that aren’t used often). Since write performance isn’t a problem in the case of SSD, but capacity is expensive, RAID 5 can help by wasting less disk space.

Regarding collecting table statistics, do you have any suggestions for analyzing large tables (over 300GB) since we had issues with MySQL detecting the wrong cardinality?
MySQL optimizer makes decisions about the execution plan (EXPLAIN), and statistics are re-estimated automatically and can be re-estimated explicitly when one calls the ANALYZE TABLE statement for the table, or the OPTIMIZE TABLE statement for InnoDB tables (which rebuilds the table and then performs an ANALYZE for the table).

When MySQL optimizer is not picking up the right index in EXPLAIN, it could be caused by outdated or wrong statistics (optimizer bugs aside). So, when you optimize the table you rebuild it so data are stored in a more compact way (assuming they changed a lot in the past) and then you re-estimate statistics based on some random sample pages checked in the table. As a result, you come up with statistics that are more correct for the data you have at the moment. This allows optimizer to get a better plan. When an explicit hint is added, you reduce possible choices for optimizer and it can use a good enough plan even with wrong statistics.

If you use versions 5.6.x and 5.7.x, and InnoDB tables, there is a way to store/fix statistics when the plans are good.  Using Persistent Optimizer Statistics prevents it from changing automatically. It’s recommended you run ANALYZE TABLE to calculate statistics (if really needed) during off peak time and make sure the table in question is not in use. Check this blogpost too.

Regarding the buffer pool, when due you think using multiple buffer pool instances make sense?
Multiple InnoDB buffer pools were introduced in MySQL 5.5, and the default value for it was 1. Now, the default value in MySQL 5.6 is 8. Enabling innodb_buffer_pool_instances is useful in highly concurrent workloads as it may reduce contention of the global mutexes. innodb_buffer_pool_instances helps to improve scalability in multi-core machines and having multiple buffer pools means that access to the buffer pool splits across all instances. Therefore, no single mutex controls the access pattern.

innodb_buffer_pool_instances only takes effect when set to 1GB (at minimum), and the total specified size for innodb_buffer_pool  is divided among all the buffer pool instances. Further, setting the innodb_buffer_pool_instances parameter is not a dynamic option, so it requires a server restart to take effect.

What do you mean “PK is appended to secondary index”
In InnoDB, secondary indexes are stored along with their corresponding primary key values. InnoDB uses this primary key value to search for the row in the clustered index. So, primary keys are implicitly added with secondary keys.

About Duplicate Keys, if I have a UNIQUE KEY on two columns, is it ok then to set a key for each of these columns also? Or should I only keep the unique key on the columns and get rid of regular key on each column also?
As I mentioned during the talk, for composite index the leftmost prefix is used. For example, If you have a UNIQUE INDEX on columns A,B as (A,B), then this index is not used for lookup for the query below:

SELECT * FROM test WHERE B='xxx';

For that query, you need a separate index on B column.

On myisam settings, doesn’t the MySQL and information_schema schemas require myisam? If so, are any settings more than default needing to be changed?
performance_schema uses the PERFORMANCE_SCHEMA storage engine, so only MySQL system database tables use the MyISAM engine. The MySQL system database is not used much and usually default settings for MyISAM engine are fine.

Will functions make my app slow compare than query?
I’m not sure how you’re comparing “queries” versus “stored functions.” Functions also need to transform, similar to the query execution plan. But it might be slower compare to well-coded SQL, even with the overhead of copying the resulting data set back to the client. Typically, functions have many SQL queries. The trade-off is that this does increase the load on the database server because more of the work is done on the server side and less is done on the client (application) side.

Foreign key will make my fetches slower?
MySQL enforces referential integrity (which ensures data consistency between related tables) via foreign keys for the InnoDB storage engine. There could be some overhead of this for the INSERT/UPDATE/DELETE foreign key column, which has to check if the value exists in a related column of other tables. There could be some overhead for this, but again it’s an index lookup so the cost shouldn’t be high. However, locking overhead comes into play as well. This blogpost from our CEO is informative on this topic. This especially affect writes, but I don’t think FK fetches slower for SELECT as it’s an index lookup.

Large pool size can have a negative impact to performance? About 62GB of pool size?
The InnoDB buffer pool is by far the most important option for InnoDB Performance, as it’s the main cache for data and indexes and it must be set correctly. Setting it large enough (i.e., larger than your dataset) shouldn’t cause any problems as long as you leave enough memory for OS needs and for MySQL buffers (e.g., sort buffer, join buffer, temporary tables, etc.).

62GB doesn’t necessarily mean a big InnoDB buffer pool. It depends on how much memory your MySQL server contains, and what the size of your total InnoDB dataset is. A good rule of thumb is to set the InnoDB buffer pool size as large as possible, while still leaving enough memory for MySQL buffers and for OS.

You find duplicate, redundant indexes by looking at information_schema.key_column_usage directly?
The key_column_usage view provides information about key columns constraints. It doesn’t provide information about duplicate or redundant indexes.

Can you find candidate missing indexes by looking at the slow query log?
Yes, as I mentioned you can find unused indexes by enabling log_queries_not_using_indexes. It writes to slow_query_log. You can also enable the user_statistics feature which adds several information_schema tables, and you can find un-used indexes with the help of user_statistics. pt-index-usage is yet another tool from Percona toolkit for this purpose. Also, check this blogpost on this topic.

How to find the unused indexes? They also have an impact on performance.
Unused indexes can be found with the help of the pt-index-usage tool from Percona toolkit as mentioned above. If you are using Percona Server, you can also use User Statistics feature. Check this blogpost from my colleague, which shows another technique to find unused indexes.

As far as I understand, MIXED will automatically use ROW for non-deterministic and STATEMENT for deterministic queries. I’ve been using it for years now without any problems. So why this recommendation of ROW?

In Mixed Mode, MySQL uses statement-based replication for most queries, switching to row-based replication only when statement-based replication would cause an inconsistency. We recommend ROW-based logging because it’s efficient and performs better as it requires less row locks. However, RBR can generate more data if a DML query affects many rows and a significant amount of data needs to be written to the binary log (and you can configure binlog_row_image parameter to control the amount of logging). Also, make sure you have good network bandwidth between master/slave(s) for RBR, as it needs to send more data to slaves. Another important thing to get best of the performance with ROW-based replication is to make sure all your database tables contain a Primary Key or Unique Key (because of this bug http://bugs.mysql.com/bug.php?id=53375).

Can you give a brief overview of sharding…The pros and cons also.
With Sharding, database data is split into multiple databases with each shard storing a subset of data. Sharding is useful to scale writes if you have huge dataset and a single server can’t handle amount of writes.

Performance and throughput could be better with sharding. On the other had, it requires lots of development and administration efforts. The application needs to be aware of the shards and keep track of which data is stored in which shard. You can use MySQL Fabric framework to manage farms of MySQL Servers. Check for details in the manual.

Why not mixed replication mode instead of row-based replication ?
As I mentioned above, MIXED uses a STATEMENT-based format by default, and converts to ROW-based replication format for non-deterministic queries. But ROW-based format is recommended as there could still be cases where MySQL fails to detect non-deterministic query behavior and replicates in a STATEMENT-based format.

Can you specify a few variables which could reduce slave lag?
Because of the single-threaded nature of MySQL (until MySQL 5.6), there is always a chance that a MySQL slave can lag from the master. I would suggest considering the below parameters to avoid slave lag:

  • innodb_flush_log_at_trx_commit <> 1, Either set it t or 0 however, it could cause you 1 second of data loss in case of crash.
  • innodb_flush_method = O_DIRECT, for unix like operating system O_DIRECT is recommended to avoid double buffering. If your InnoDB data and log files are located on SAN then O_DIRECT is probably not good choice.
  • log_bin = 0, Disable binary logging (if enabled) to minimize extra Disk IO.
  • sync_binlog = 0, Disable sync_binlog.

Those above parameters would definitely help to minimize slave lag. However, along with that make sure your slave(s) hardware is as strong as the master. Make sure your read queries are fast enough. Don’t overload slave to much, and distribute read traffic evenly between slave(s). Also, you should have the same table definitions on slave(s) as the master (e.g., master server indexes must exists on slave(s) tables too). Last but not least, I wrote a blogpost on how to diagnose and cure replication lag. It might be useful for further reading.

pt-table-checksum/sync issue with MySQL driver

Lastest Forum Posts - August 19, 2016 - 2:32pm
I am very new to the Percona Toolkit and have been trying to test out table sync and checksum but I cant seem to get them to work. I have a 2 node test cluster. I have deleted a table on node 2 (slave) and wanted to see if checksum would pick it up. I ran the following command on the master

pt-table-checksum --replicate=percona.checksums --replicate-check-only --ignore-databases mysql h-localhost,u=user,p=password

I got the following error:
install_driver(mysql) failed: Attempt to reload DBD/mysql.pm aborted.
Compilation failed in require at (eval 23) line 3.

at /usr/bin/pt-table-checksum line 1581

When I run the same command on node 2 it runs just fine but since it is the salve it doesn't see a problem.

So I check both mysql drivers and they are the same version.
perl-DBD-MySQL-4.013-3.el6.x86_64

So I am a little lost as to what the issues could be.

Please note I am running an older version of mysql on node 2 but I don't think that would make a difference would it?

5.6.29-76.2 (Node1) 5.6.22-71.0 (Node2)

Percona Server 5.5.51-38.1 is now available

Latest MySQL Performance Blog posts - August 19, 2016 - 8:58am

Percona announces the release of Percona Server 5.5.51-38.1 on August 19, 2016. Based on MySQL 5.5.51, including all the bug fixes in it, Percona Server 5.5.51-38.1 is now the current stable release in the 5.5 series.

Percona Server is open-source and free. You can find release details of the release in the 5.5.51-38.1 milestone on Launchpad. Downloads are available here and from the Percona Software Repositories.

Bugs Fixed:
  • PAM Authentication Plugin would abort authentication while checking UNIX user group membership if there were more than a thousand members. Bug fixed #1608902.
  • PAM Authentication Plugin didn’t support spaces in the UNIX user group names. Bug fixed #1544443.
  • If DROP DATABASE would fail to delete some of the tables in the database, the partially-executed command is logged in the binlog as DROP TABLE t1, t2, ... for the tables for which drop succeeded. A slave might fail to replicate such DROP TABLE statement if there exist foreign key relationships to any of the dropped tables and the slave has a different schema from the master. Fixed by checking, on the master, whether any of the database to be dropped tables participate in a Foreign Key relationship, and fail the DROP DATABASE statement immediately. Bug fixed #1525407 (upstream #79610).
  • Percona Server 5.5 could not be built with the -DMYSQL_MAINTAINER_MODE=ON option. Bug fixed #1590454.
  • In the client library, any EINTR received during network I/O was not handled correctly. Bug fixed #1591202 (upstream #82019).
  • The included .gitignore in the percona-server source distribution had a line *.spec, which means someone trying to check in a copy of the percona-server source would be missing the spec file required to build the RPM packages. Bug fixed #1600051.
  • The fix for bug #1341067 added a call to free some of the heap memory allocated by OpenSSL. This was not safe for repeated calls if OpenSSL is linked twice through different libraries and each is trying to free the same. Bug fixed #1604676.
  • If the changed page bitmap redo log tracking thread stops due to any reason, then shutdown will wait for a long time for the log tracker thread to quit, which it never does. Bug fixed #1606821.
  • Performing slow InnoDB shutdown (innodb_fast_shutdown set to 0) could result in an incomplete purge, if a separate purge thread is running (which is a default in Percona Server). Bug fixed #1609364.
Other bugs fixed:

#1515591 (upstream #79249), #1612551, #1609523, #756387, #1097870, #1603073, #1606478, #1606572, #1606782, #1607224, #1607359, #1607606, #1607607, #1607671, #1608385, #1608424, #1608437, #1608515, #1608845, #1609422, #1610858, #1612084, #1612118, and #1613641.

Find the release notes for Percona Server 5.5.51-38.1 in our online documentation. Report bugs on the launchpad bug tracker.

All of nodes down in cluster

Lastest Forum Posts - August 19, 2016 - 5:47am
Hello,

Today, I had issue with cluster. All of nodes have been shutdown in the same time. I have 3 nodes cluster. In logs I sppotted that wsrep initial shutdown:
Code: 2016-08-19 10:01:14 1830 [Warning] WSREP: certification interval for trx source: b9e7c19f-611b-11e6-a681-f74f157327be version: 3 local: 1 state: CERTIFYING flags: 1 conn_id: 71 297 trx_id: 32216209837 seqnos (l: 253333669, g: 6904092724, s: 6904073643, d: -1, ts: 1451339341459664) exceeds the limit of 16384 2016-08-19 10:01:14 1830 [Warning] WSREP: certification interval for trx source: b9e7c19f-611b-11e6-a681-f74f157327be version: 3 local: 1 state: CERTIFYING flags: 65 conn_id: 1 45554 trx_id: -1 seqnos (l: 253333670, g: 6904092725, s: 6904074034, d: -1, ts: 1451344454512438) exceeds the limit of 16384 2016-08-19 10:01:14 1830 [ERROR] WSREP: Certification failed for TO isolated action: source: b9e7c19f-611b-11e6-a681-f74f157327be version: 3 local: 1 state: CERTIFYING flags: 6 5 conn_id: 145554 trx_id: -1 seqnos (l: 253333670, g: 6904092725, s: 6904074034, d: -1, ts: 1451344454512438) 2016-08-19 10:01:14 1830 [Note] WSREP: Closing send monitor... 2016-08-19 10:01:14 1830 [Note] WSREP: Closed send monitor. 2016-08-19 10:01:14 1830 [Note] WSREP: gcomm: terminating thread 2016-08-19 10:01:14 1830 [Note] WSREP: gcomm: joining thread 2016-08-19 10:01:14 1830 [Note] WSREP: gcomm: closing backend 2016-08-19 10:01:15 1830 [Note] WSREP: view(view_id(NON_PRIM,9e529808,214) memb { b9e7c19f,0 } joined { } left { } partitioned { 9e529808,0 }) 2016-08-19 10:01:15 1830 [Note] WSREP: view((empty)) 2016-08-19 10:01:15 1830 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1 2016-08-19 10:01:15 1830 [Note] WSREP: gcomm: closed 2016-08-19 10:01:15 1830 [Note] WSREP: Flow-control interval: [512, 512] 2016-08-19 10:01:15 1830 [Note] WSREP: Received NON-PRIMARY. 2016-08-19 10:01:15 1830 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 6904092891) 2016-08-19 10:01:15 1830 [Note] WSREP: Received self-leave message. 2016-08-19 10:01:15 1830 [Note] WSREP: Flow-control interval: [0, 0] 2016-08-19 10:01:15 1830 [Note] WSREP: Received SELF-LEAVE. Closing connection. 2016-08-19 10:01:15 1830 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 6904092891) 2016-08-19 10:01:15 1830 [Note] WSREP: RECV thread exiting 0: Success 2016-08-19 10:01:15 1830 [Note] WSREP: recv_thread() joined. 2016-08-19 10:01:15 1830 [Note] WSREP: Closing replication queue. 2016-08-19 10:01:15 1830 [Note] WSREP: Closing slave action queue. 2016-08-19 10:01:15 1830 [Note] WSREP: /usr/sbin/mysqld: Terminated. The same situation was on other nodes. Mysql has treid to start, but I think that couldn't, becasuse needed bootstraping node.

What could be a reason of that problem. Version whish actually used is 5.6.30-25.16-1.xenial

Best regards,

Replicate same schema from two master and insert into Slave

Lastest Forum Posts - August 19, 2016 - 12:39am
Hello Support ,

We are using Percona Server 5.7.14 and we want to achieve something like this:

1. Master A has schema : Sakila
2. Master B has schema : Sakila

Now There is a Slave C which has Schema Sakila_1 and Sakila_2.
We want to two use two channels one for Master A and one for Master B.

Please let us know what config change we need to do on Slave C (specific to replicate_rewrite_db) and multiple replication channels how we should start.

[mysqld]

masterA.replicate-rewrite-db = sakila -> sakila_1

masterB.replicate-rewrite-db=sakila -> sakila_2
...


Please let us know your thoughts on this.
I was going through this blog and thought may be possible with Percona server too - https://mariadb.com/blog/multisource...name-conflicts

Any inputs would be appreciated.


Many Thanks,
Mahesh

ProxySQL 1.2.1 GA Release

Latest MySQL Performance Blog posts - August 18, 2016 - 10:55am

The GA release of ProxySQL 1.2.1 is available. You can get it from https://github.com/sysown/proxysql/releases. There are also Docker images for Release 1.2.1: https://hub.docker.com/r/percona/proxysql/.

This post is published with René Cannaò’s (creator of ProxySQL) approval. René is busy implementing more new ProxySQL features, so I decided to make this announcement!

Release highlights:
  • Support for backend SSL connections
  • Support for encrypted password  (mysql_users table now supports both plain text password and hashed password, in the same format of mysql.user.password)
  • Improved monitoring module
  • Better integration with Percona XtraDB Cluster
    • New feature: the Scheduler, that allows the extension of ProxySQL with external scripts

The last point is especially important in conjunction with our recent Percona XtraDB Cluster 5.7 RC1 release. When we ship Percona XtraDB Cluster 5.7 GA, we plan to make ProxySQL the default proxy solution choice for Percona XtraDB Cluster. ProxySQL is aware of the cluster and node status, and can direct traffic appropriately.

ProxySQL 1.2.1 comes with additional scripts to support this:

ProxySQL 1.2.1 and these scripts are compatible with existing Percona XtraDB Cluster 5.6 GA releases.

ProxySQL 1.2.1 is a solid release, currently used by many demanding high-performance production workloads – it is already battle tested! Give it a try if you are looking for a proxy solution.

ProxySQL is available under OpenSource license GPLv3, which allows you unlimited usage in production. ProxySQL has no plans to change the license!

Best Practices to backup and restore for Percona XtraDB Cluster

Lastest Forum Posts - August 18, 2016 - 2:10am
Hi,

I am newbie to Percona XtraDB Cluster. I have created 3 Node Cluster by referring your Manual.
Node 1 dedicated Machine = Donor
Node 2 dedicated Machine = Joiner 1
Node 3 dedicated Machine = Joiner 2
Node 4 dedicated Machine = Joiner 3

I am using following SST method.
+------------------+---------------+
| wsrep_sst_method | xtrabackup-v2 |
+------------------+---------------+

My question is how I should perform the backups using xtrabackup (I am not using innobackupex).
Do I need to launch xtrabackup on all Nodes - Node 1 - Node 4 and make total of 4 backups?
Or
Do I need to launch the xtrabackup only on donor Node 1?
Or
Do I need to launch the xtrabackup on any Node, does not matter if it is donor or joiner?

For restore, once I have the backup folder/backup image, I can restore on any Node and data will be replicated to all Nodes
or I need to perform the restore on each Node individually before starting the donor and joiners?

Thanks

PXC Install Fails -- Installed Dependent Packages Not Recognized

Lastest Forum Posts - August 17, 2016 - 1:36pm
Unable to use the Percona repo at www.percona.com, I downloaded the PXC tarball Percona-XtraDB-Cluster-5.6.30-25.16-raa929cb-el7-x86_64-bundle.tar, extracted the RPMs, individually installed client, galera-3, cluster shared, libev.so and xtrabackup. However, when I attempted to install the PXC cluster package, I got this error:

[root@newcst-e-db01 download]# yum install Percona-XtraDB-Cluster-server-56-5.6.30-25.16.1.el7.x86_64.rpm
Loaded plugins: product-id, search-disabled-repos, subscription-manager
This system is not registered to Red Hat Subscription Management. You can use subscription-manager to register.
Examining Percona-XtraDB-Cluster-server-56-5.6.30-25.16.1.el7.x86_64.rpm: 1:Percona-XtraDB-Cluster-server-7.56-5.6.30-25.16.1.el.x86_64
Marking Percona-XtraDB-Cluster-server-56-5.6.30-25.16.1.el7.x86_64.rpm to be installed
Resolving Dependencies
--&gt; Running transaction check
---&gt; Package Percona-XtraDB-Cluster-server-56.x86_64 1:5.6.30-25.16.1.el7 will be installed
--&gt; Processing Dependency: percona-xtrabackup &gt;= 2.2.5 for package: 1:Percona-XtraDB-Cluster-server-56-5.6.30-25.16.1.el7.x86_64
--&gt; Processing Dependency: lsof for package: 1:Percona-XtraDB-Cluster-server-56-5.6.30-25.16.1.el7.x86_64
--&gt; Running transaction check
---&gt; Package Percona-XtraDB-Cluster-server-56.x86_64 1:5.6.30-25.16.1.el7 will be installed
--&gt; Processing Dependency: percona-xtrabackup &gt;= 2.2.5 for package: 1:Percona-XtraDB-Cluster-server-56-5.6.30-25.16.1.el7.x86_64
---&gt; Package lsof.x86_64 0:4.87-4.el7 will be installed
--&gt; Finished Dependency Resolution
Error: Package: 1:Percona-XtraDB-Cluster-server-56-5.6.30-25.16.1.el7.x86_64 (/Percona-XtraDB-Cluster-server-56-5.6.30-25.16.1.e l7.x86_64)
Requires: percona-xtrabackup &gt;= 2.2.5
You could try using --skip-broken to work around the problem
You could try running: rpm -Va --nofiles --nodigest

-- but xtrabackup's installed:

[root@newcst-e-db01 download]# yum list "percona*backup*"
Loaded plugins: product-id, search-disabled-repos, subscription-manager
This system is not registered to Red Hat Subscription Management. You can use subscription-manager to register.
Installed Packages
percona-xtrabackup-24.x86_64 2.4.3-1.el7 @/percona-xtrabackup-24-2.4.3-1.el7.x86_64

Note that Xtrabackup 2.4.3 is listed -- I had installed 2.4.4, but when the PXC server install error occurred, attempted to rectify by dropping 2.4.4 and loading 2.4.3. It didn't matter.

Has anyone else experienced this problem? Is there a workaround? I'd really like to get the s/w installed.

root@newcst-e-db01 download]# uname -a
Linux newcst-e-db01 3.10.0-327.28.2.el7.x86_64 #1 SMP Mon Jun 27 14:48:28 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux
[root@newcst-e-db01 download]# cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.2 (Maipo)

Improve TokuDB/PerconaFT fragmented data file performance

Latest MySQL Performance Blog posts - August 17, 2016 - 10:05am

In this blog post, we’ll discuss how to improve TokuDB and PerconaFT fragmented data file performance.

Through our internal benchmarking and some user reports, we have found that with long term heavy write use TokuDB/PerconaFT performance can degrade significantly on large data files. Using smaller node sizes makes the problem (which is one of our performance tuning recommendations when you have faster storage). The problem manifests as low CPU utilization, a drop in overall TPS and high client response times during prolonged checkpointing.

This post explains a little about how we structure PerconaFT dictionary files and where the current implementation breaks down. Hopefully, it explains the nature of the issue, and how our solution helps addresses it. It also provides some contrived benchmarks that prove the solution.

PerconaFT map file disk format

NOTE. This post uses the terms index, data file, and dictionary are somewhat interchangeable. We will use the PerconaFT term “dictionary” to refer specifically to a PerconaFT key/value data file.

PerconaFT stores every dictionary in its own data file on disk. TokuDB stores each index in a PerconaFT dictionary, plus one additional dictionary per table for some metadata. For example, if you have one TokuDB table with two secondary indices, you would have four data files or dictionaries: one small metadata dictionary for the table, one dictionary for the primary key/index, and one for each secondary index.

Each dictionary file has three major parts:

  • Two headers (yes, two) made up of various bits of metadata, file versions, a checkpoint logical sequence number (CLSN), the offset of this headers block translation table, etc…
  • Two (yes, two, one per header) block translation tables (BTT) that maps block numbers (BNs) to the physical offsets and sizes of the data blocks within the file.
  • Data blocks and holes (unused space). Unlike InnoDB, PerconaFT data blocks (nodes) are variable sizes and can be any size from a minimum of a few bytes for an empty internal node all the way up to the block size defined when the tree created (4MB by default if we don’t use compression) and anywhere in between, depending on the amount of data within that node.

Each dictionary file contains two versions of the header stored on disk, and only one is valid at any given point in time. Since we fix the size of the header structure, we always know their locations. The first at offset zero, the other is immediately after the first. The header that is currently valid is the header with the later/larger CLSN.

We write the header and the BTT to disk during a checkpoint or when a dictionary is closed (the only time we do so). The header overwrites the older header (the one with the older CLSN) on disk. From that moment onward, the disk space used by the previous version of the dictionary (the whole thing, not just the header) that is not also used by the latest version, is considered immediately free.

There is much more magic to how the PerconaFT does checkpoint and consistency, but that is really out of the scope of this post. Maybe a later post that addresses the sharp checkpoint of the PerconaFT can dive into this.

The block allocator

The block allocator is the algorithm and container that manages the list of known used blocks and unused holes within an open dictionary file. When a node gets written, it is the responsibility of the block allocator to find a suitable location in the file for the nodes data. It is always placed into a new block, never overwrites an existing block (except for reclaimed block space from blocks that are removed or moved and recorded during the last checkpoint). Conversely, when a node gets destroyed it is the responsibility of the block allocator to release that used space and create a hole out of the old block. That hole also must be merged with any other holes that are adjacent to it to have a record of just one large hole rather than a series of consecutive smaller holes.

Fragmentation and large files

The current implementation of the PerconaFT block allocator maintains a simple array of used blocks in memory for each open dictionary. The used blocks are ordered ascending by their offset in the file. The holes between the blocks are calculated by knowing the offset and size of the two bounding blocks. For example, one can calculate the hole offset and size between two adjacent blocks as: b[n].offset + b[n].size and b[n+1].offset – (b[n].offset + b[n].size), respectively.

To find a suitable hole to place node data, the current block allocator starts at the first block in the array. It iterates through the blocks looking for a hole between blocks that is large enough to hold the nodes data. Once we find a hole, we cut the space needed for the node out of the hole and the remainder is left as a hole for another block to possibly use later. Note : There is overhead to force alignment to 512 offsets for direct i/o regardless if direct i/o is used or not.

Note. Forcing alignment to 512 offsets for direct I/O has overhead, regardless if direct I/O is used or not.

This linear search severely degrades the PerconaFT performance for very large and fragmented dictionary files. We have some solid evidence from the field that this does occur. We can see it via various profiling tools as a lot of time spent within block_allocator_strategy::first_fit. It is also quite easy to create a case by using very small node (block) sizes and small fanouts (forces the existence of more nodes, and thus more small holes). This fragmentation can and does cause all sorts of side effects as the search operation locks the entire structure within memory. It blocks nodes from translating their node/block IDs into file locations.

Let’s fix it…

In this block storage paradigm, fragmentation is inevitable. We can try to dance around and propose different ways to prevent fragmentation (at the expense of higher CPU costs, online/offline operations, etc…). Or, we can look at the way the block allocator works and try to make it more efficient. Attacking the latter of the two options is a better strategy (not to say we aren’t still actively looking into the former).

Tree-based “Max Hole Size” (MHS) lookup

The linear search block allocator has no idea where bigger and smaller holes might be located within the set (a core limitation). It must use brute force to find a hole big enough for the data it needs to store. To address this, we implemented a new in-memory, tree-based algorithm (red-black tree). This replaces the current in-memory linear array and integrates the hole size search into the tree structure itself.

In this new block allocator implementation, we store the set of known in-use blocks within the node structure of a binary tree instead of a linear array. We order the  tree by the file offset of the blocks. We then added a little extra data to each node of this new tree structure. This data tells us the maximum hole we can expect to find in each child subtree. So now when searching for a hole, we can quickly drill down the tree to find an available hole of the correct size without needing to perform a fully linear scan. The trade off is that merging holes together and updating the parental max hole sizes is slightly more intricate and time-consuming than in a linear structure. The huge improvement in search efficiency makes this extra overhead pure noise.

We can see in this overly simplified diagram, we have five blocks:

  • offset 0 : 1 byte
  • offset 3 : 2 bytes
  • offset 6 : 3 bytes
  • offset 10 : 5 bytes
  • offset 20 : 8 bytes

We can calculate four holes in between those blocks:

  • offset 1 : 2 bytes
  • offset 5 : 1 byte
  • offset 9 : 1 byte
  • offset 15 : 5 bytes

We see that the search for a 4-byte hole traverses down the right side of the tree. It discovers a hole at offset 15. This hole is a big enough for our 4 bytes. It does this without needing to visit the nodes at offsets 0 and 3. For you algorithmic folks out there, we have gone from an O(n) to O(log n) search. This is tremendously more efficient when we get into severe fragmentation states. A side effect is that we tend to allocate blocks from holes closer to the needed size rather than from the first one big enough to fit. The small hole fragmentation issue may actually increase over time, but that has yet to be seen in our testing.

Benchmarks

As our CTO Vadim Tkachenko asserts, there are “Lies, Damned Lies and Benchmarks.” We won’t convince you using some pseudo-real-world benchmark that uses sleight of hand. We’re going to show a simple test case where we thought, “What is the worst possible scenario that I can come up with in a small-ish benchmark to show the differences?”.

That scenario is actually pretty simple. We shape the tree to have as many nodes as possible, and intentionally use settings that reduce concurrency. We will use a standard sysbench OLTP test, and run it for about three hours after the prepare stage has completed:

  • Hardware:
    • Intel i7, 4 core hyperthread (8 virtual cores) @ 2.8 GHz
    • 16 GB of memory
    • Samsung 850 Pro SSD
  • Sysbench OLTP:
    • 1 table of 160M rows or about 30GB of primary key data and 4GB secondary key data
    • 24 threads
    • We started each test server instance with no data. Then we ran the sysbench prepare, then the sysbench run with no shutdown in between the prepare and run.
    • prepare command : /data/percona/sysbench/sysbench/sysbench –test=/data/percona/sysbench/sysbench/tests/db/parallel_prepare.lua –mysql-table-engine=tokudb –oltp-tables-count=1 –oltp-table-size=160000000 –mysql-socket=$(PWD)/var/mysql.sock –mysql-user=root –num_threads=1 run
    • run command : /data/percona/sysbench/sysbench/sysbench –test=/data/percona/sysbench/sysbench/tests/db/oltp.lua –mysql-table-engine=tokudb –oltp-tables-count=1 –oltp-table-size=160000000 –rand-init=on –rand-type=uniform –num_threads=24 –report-interval=30 –max-requests=0 –max-time=10800 –percentile=99 –mysql-socket=$(PWD)/var/mysql.sock –mysql-user=root run
  • mysqld/TokuDB configuration
    • innodb_buffer_pool_size=5242880
    • tokudb_directio=on
    • tokudb_empty_scan=disabled
    • tokudb_commit_sync=off
    • tokudb_cache_size=8G
    • tokudb_checkpointing_period=300
    • tokudb_checkpoint_pool_threads=1
    • tokudb_enable_partial_eviction=off
    • tokudb_fsync_log_period=1000
    • tokudb_fanout=8
    • tokudb_block_size=8K
    • tokudb_read_block_size=1K
    • tokudb_row_format=tokudb_uncompressed
    • tokudb_cleaner_period=1
    • tokudb_cleaner_iterations=10000

So as you can see: amazing results, right? Sustained throughput, immensely better response time and better utilization of available CPU resources. Of course, this is all fake with a tree shape that no sane user would implement. It illustrates what happens when the linear list contains small holes: exactly what we set out to fix!

Closing

Look for this improvement to appear in Percona Server 5.6.32-78.0 and 5.7.14-7. It’s a good one for you if you have huge TokuDB data files with lots and lots of nodes.

Credits!

Throughout this post, I referred to “we” numerous times. That “we” encompasses a great many people that have looked into this in the past and implemented the current solution. Some are current and former Percona and Tokutek employees that you may already know by name. Some are newer at Percona. I got to take their work and research, incorporate it into the current codebase, test and benchmark it, and report it here for all to see. Many thanks go out to Jun Yuan, Leif Walsh, John Esmet, Rich Prohaska, Bradley Kuszmaul, Alexey Stroganov, Laurynas Biveinis, Vlad Lesin, Christian Rober and others for all of the effort in diagnosing this issue, inventing a solution, and testing and reviewing this change to the PerconaFT library.

How Apache Spark makes your slow MySQL queries 10x faster (or more)

Latest MySQL Performance Blog posts - August 17, 2016 - 8:26am

In this blog post, we’ll discuss how to improve the performance of slow MySQL queries using Apache Spark.

Introduction

In my previous blog post, I wrote about using Apache Spark with MySQL for data analysis and showed how to transform and analyze a large volume of data (text files) with Apache Spark. Vadim also performed a benchmark comparing performance of MySQL and Spark with Parquet columnar format (using Air traffic performance data). That works great, but what if we don’t want to move our data from MySQL to another storage (i.e., columnar format), and instead want to use “ad hock” queries on top of an existing MySQL server? Apache Spark can help here as well.

TL;DR version:

Using Apache Spark on top of the existing MySQL server(s) (without the need to export or even stream data to Spark or Hadoop), we can increase query performance more than ten times. Using multiple MySQL servers (replication or Percona XtraDB Cluster) gives us an additional performance increase for some queries. You can also use the Spark cache function to cache the whole MySQL query results table.

The idea is simple: Spark can read MySQL data via JDBC and can also execute SQL queries, so we can connect it directly to MySQL and run the queries. Why is this faster? For long running (i.e., reporting or BI) queries, it can be much faster as Spark is a massively parallel system. MySQL can only use one CPU core per query, whereas Spark can use all cores on all cluster nodes. In my examples below, MySQL queries are executed inside Spark and run 5-10 times faster (on top of the same MySQL data).

In addition, Spark can add “cluster” level parallelism. In the case of MySQL replication or Percona XtraDB Cluster, Spark can split the query into a set of smaller queries (in the case of a partitioned table it will run one query per each partition for example) and run those in parallel across multiple slave servers of multiple Percona XtraDB Cluster nodes. Finally, it will use map/reduce the type of processing to aggregate the results.

I’ve used the same “Airlines On-Time Performance” database as in previous posts. Vadim created some scripts to download data and upload it to MySQL. You can find the scripts here: https://github.com/Percona-Lab/ontime-airline-performance. I’ve also used Apache Spark 2.0, which was released July 26, 2016.

Apache Spark Setup

Starting Apache Spark in standalone mode is easy. To recap:

  1. Download the Apache Spark 2.0 and place it somewhere.
  2. Start master
  3. Start slave (worker) and attach it to the master
  4. Start the app (in this case spark-shell or spark-sql)

Example:

root@thor:~/spark# ./sbin/start-master.sh less ../logs/spark-root-org.apache.spark.deploy.master.Master-1-thor.out 15/08/25 11:21:21 INFO Master: Starting Spark master at spark://thor:7077 15/08/25 11:21:21 INFO Utils: Successfully started service 'MasterUI' on port 8080. 15/08/25 11:21:21 INFO MasterWebUI: Started MasterWebUI at http://10.60.23.188:8080 root@thor:~/spark# ./sbin/start-slave.sh spark://thor:7077

To connect to Spark we can use spark-shell (Scala), pyspark (Python) or spark-sql. Since spark-sql is similar to MySQL cli, using it would be the easiest option (even “show tables” works). I also wanted to work with Scala in interactive mode so I’ve used spark-shell as well. In all the examples I’m using the same SQL query in MySQL and Spark, so working with Spark is not that different.

To work with MySQL server in Spark we need Connector/J for MySQL. Download the package and copy the mysql-connector-java-5.1.39-bin.jar to the spark directory, then add the class path to the conf/spark-defaults.conf:

spark.driver.extraClassPath = /usr/local/spark/mysql-connector-java-5.1.39-bin.jar spark.executor.extraClassPath = /usr/local/spark/mysql-connector-java-5.1.39-bin.jar

Running MySQL queries via Apache Spark

For this test I was using one physical server with 12 CPU cores (older Intel(R) Xeon(R) CPU L5639 @ 2.13GHz) and 48G of RAM, SSD disks. I’ve installed MySQL and started spark master and spark slave on the same box.

Now we are ready to run MySQL queries inside Spark. First, start the shell (from the Spark directory, /usr/local/spark in my case):

$ ./bin/spark-shell --driver-memory 4G --master spark://server1:7077

Then we will need to connect to MySQL from spark and register the temporary view:

val jdbcDF = spark.read.format("jdbc").options( Map("url" -> "jdbc:mysql://localhost:3306/ontime?user=root&password=", "dbtable" -> "ontime.ontime_part", "fetchSize" -> "10000", "partitionColumn" -> "yeard", "lowerBound" -> "1988", "upperBound" -> "2016", "numPartitions" -> "28" )).load() jdbcDF.createOrReplaceTempView("ontime")

So we have created a “datasource” for Spark (or in other words, a “link” from Spark to MySQL). The Spark table name is “ontime” (linked to MySQL ontime.ontime_part table) and we can run SQL queries in Spark, which in turn parse it and translate it in MySQL queries.

partitionColumn” is very important here. It tells Spark to run multiple queries in parallel, one query per each partition.

Now we can run the query:

val sqlDF = sql("select min(year), max(year) as max_year, Carrier, count(*) as cnt, sum(if(ArrDelayMinutes>30, 1, 0)) as flights_delayed, round(sum(if(ArrDelayMinutes>30, 1, 0))/count(*),2) as rate FROM ontime WHERE DayOfWeek not in (6,7) and OriginState not in ('AK', 'HI', 'PR', 'VI') and DestState not in ('AK', 'HI', 'PR', 'VI') and (origin = 'RDU' or dest = 'RDU') GROUP by carrier HAVING cnt > 100000 and max_year > '1990' ORDER by rate DESC, cnt desc LIMIT 10") sqlDF.show()

MySQL Query Example

Let’s go back to MySQL for a second and look at the query example. I’ve chosen the following query (from my older blog post):

select min(year), max(year) as max_year, Carrier, count(*) as cnt, sum(if(ArrDelayMinutes>30, 1, 0)) as flights_delayed, round(sum(if(ArrDelayMinutes>30, 1, 0))/count(*),2) as rate FROM ontime WHERE DayOfWeek not in (6,7) and OriginState not in ('AK', 'HI', 'PR', 'VI') and DestState not in ('AK', 'HI', 'PR', 'VI') GROUP by carrier HAVING cnt > 100000 and max_year > '1990' ORDER by rate DESC, cnt desc LIMIT 10

The query will find the total number of delayed flights per each airline. In addition, the query will calculate the smart “ontime” rating, taking into consideration the number of flights (we do not want to compare smaller air carriers with the large ones, and we want to exclude the older airlines who are not in business anymore).

The main reason I’ve chosen this query is that it is hard to optimize it in MySQL. All conditions in the “where” clause will only filter out ~70% of rows. I’ve done a basic calculation:

mysql> select count(*) FROM ontime WHERE DayOfWeek not in (6,7) and OriginState not in ('AK', 'HI', 'PR', 'VI') and DestState not in ('AK', 'HI', 'PR', 'VI'); +-----------+ | count(*) | +-----------+ | 108776741 | +-----------+ mysql> select count(*) FROM ontime; +-----------+ | count(*) | +-----------+ | 152657276 | +-----------+ mysql> select round((108776741/152657276)*100, 2); +-------------------------------------+ | round((108776741/152657276)*100, 2) | +-------------------------------------+ | 71.26 | +-------------------------------------+

Table structure:

CREATE TABLE `ontime_part` ( `YearD` int(11) NOT NULL, `Quarter` tinyint(4) DEFAULT NULL, `MonthD` tinyint(4) DEFAULT NULL, `DayofMonth` tinyint(4) DEFAULT NULL, `DayOfWeek` tinyint(4) DEFAULT NULL, `FlightDate` date DEFAULT NULL, `UniqueCarrier` char(7) DEFAULT NULL, `AirlineID` int(11) DEFAULT NULL, `Carrier` char(2) DEFAULT NULL, `TailNum` varchar(50) DEFAULT NULL, ... `id` int(11) NOT NULL AUTO_INCREMENT, PRIMARY KEY (`id`,`YearD`), KEY `covered` (`DayOfWeek`,`OriginState`,`DestState`,`Carrier`,`YearD`,`ArrDelayMinutes`) ) ENGINE=InnoDB AUTO_INCREMENT=162668935 DEFAULT CHARSET=latin1 /*!50100 PARTITION BY RANGE (YearD) (PARTITION p1987 VALUES LESS THAN (1988) ENGINE = InnoDB, PARTITION p1988 VALUES LESS THAN (1989) ENGINE = InnoDB, PARTITION p1989 VALUES LESS THAN (1990) ENGINE = InnoDB, PARTITION p1990 VALUES LESS THAN (1991) ENGINE = InnoDB, PARTITION p1991 VALUES LESS THAN (1992) ENGINE = InnoDB, PARTITION p1992 VALUES LESS THAN (1993) ENGINE = InnoDB, PARTITION p1993 VALUES LESS THAN (1994) ENGINE = InnoDB, PARTITION p1994 VALUES LESS THAN (1995) ENGINE = InnoDB, PARTITION p1995 VALUES LESS THAN (1996) ENGINE = InnoDB, PARTITION p1996 VALUES LESS THAN (1997) ENGINE = InnoDB, PARTITION p1997 VALUES LESS THAN (1998) ENGINE = InnoDB, PARTITION p1998 VALUES LESS THAN (1999) ENGINE = InnoDB, PARTITION p1999 VALUES LESS THAN (2000) ENGINE = InnoDB, PARTITION p2000 VALUES LESS THAN (2001) ENGINE = InnoDB, PARTITION p2001 VALUES LESS THAN (2002) ENGINE = InnoDB, PARTITION p2002 VALUES LESS THAN (2003) ENGINE = InnoDB, PARTITION p2003 VALUES LESS THAN (2004) ENGINE = InnoDB, PARTITION p2004 VALUES LESS THAN (2005) ENGINE = InnoDB, PARTITION p2005 VALUES LESS THAN (2006) ENGINE = InnoDB, PARTITION p2006 VALUES LESS THAN (2007) ENGINE = InnoDB, PARTITION p2007 VALUES LESS THAN (2008) ENGINE = InnoDB, PARTITION p2008 VALUES LESS THAN (2009) ENGINE = InnoDB, PARTITION p2009 VALUES LESS THAN (2010) ENGINE = InnoDB, PARTITION p2010 VALUES LESS THAN (2011) ENGINE = InnoDB, PARTITION p2011 VALUES LESS THAN (2012) ENGINE = InnoDB, PARTITION p2012 VALUES LESS THAN (2013) ENGINE = InnoDB, PARTITION p2013 VALUES LESS THAN (2014) ENGINE = InnoDB, PARTITION p2014 VALUES LESS THAN (2015) ENGINE = InnoDB, PARTITION p2015 VALUES LESS THAN (2016) ENGINE = InnoDB, PARTITION p_new VALUES LESS THAN MAXVALUE ENGINE = InnoDB) */

Even with a “covered” index, MySQL will have to scan ~70M-100M of rows and create a temporary table:

mysql> explain select min(yearD), max(yearD) as max_year, Carrier, count(*) as cnt, sum(if(ArrDelayMinutes>30, 1, 0)) as flights_delayed, round(sum(if(ArrDelayMinutes>30, 1, 0))/count(*),2) as rate FROM ontime_part WHERE DayOfWeek not in (6,7) and OriginState not in ('AK', 'HI', 'PR', 'VI') and DestState not in ('AK', 'HI', 'PR', 'VI') GROUP by carrier HAVING cnt > 1000 and max_year > '1990' ORDER by rate DESC, cnt desc LIMIT 10G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: ontime_part type: range possible_keys: covered key: covered key_len: 2 ref: NULL rows: 70483364 Extra: Using where; Using index; Using temporary; Using filesort 1 row in set (0.00 sec)

What is the query response time in MySQL:

mysql> select min(yearD), max(yearD) as max_year, Carrier, count(*) as cnt, sum(if(ArrDelayMinutes>30, 1, 0)) as flights_delayed, round(sum(if(ArrDelayMinutes>30, 1, 0))/count(*),2) as rate FROM ontime_part WHERE DayOfWeek not in (6,7) and OriginState not in ('AK', 'HI', 'PR', 'VI') and DestState not in ('AK', 'HI', 'PR', 'VI') GROUP by carrier HAVING cnt > 1000 and max_year > '1990' ORDER by rate DESC, cnt desc LIMIT 10; +------------+----------+---------+----------+-----------------+------+ | min(yearD) | max_year | Carrier | cnt | flights_delayed | rate | +------------+----------+---------+----------+-----------------+------+ | 2003 | 2013 | EV | 2962008 | 464264 | 0.16 | | 2003 | 2013 | B6 | 1237400 | 187863 | 0.15 | | 2006 | 2011 | XE | 1615266 | 230977 | 0.14 | | 2003 | 2005 | DH | 501056 | 69833 | 0.14 | | 2001 | 2013 | MQ | 4518106 | 605698 | 0.13 | | 2003 | 2013 | FL | 1692887 | 212069 | 0.13 | | 2004 | 2010 | OH | 1307404 | 175258 | 0.13 | | 2006 | 2013 | YV | 1121025 | 143597 | 0.13 | | 2003 | 2006 | RU | 1007248 | 126733 | 0.13 | | 1988 | 2013 | UA | 10717383 | 1327196 | 0.12 | +------------+----------+---------+----------+-----------------+------+ 10 rows in set (19 min 16.58 sec)

19 minutes is definitely not great.

SQL in Spark

Now we want to run the same query inside Spark and let Spark read data from MySQL. We will create a “datasource” and execute the query:

scala> val jdbcDF = spark.read.format("jdbc").options( | Map("url" -> "jdbc:mysql://localhost:3306/ontime?user=root&password=mysql", | "dbtable" -> "ontime.ontime_sm", | "fetchSize" -> "10000", | "partitionColumn" -> "yeard", "lowerBound" -> "1988", "upperBound" -> "2015", "numPartitions" -> "48" | )).load() 16/08/02 23:24:12 WARN JDBCRelation: The number of partitions is reduced because the specified number of partitions is less than the difference between upper bound and lower bound. Updated number of partitions: 27; Input number of partitions: 48; Lower bound: 1988; Upper bound: 2015. dbcDF: org.apache.spark.sql.DataFrame = [id: int, YearD: date ... 19 more fields] scala> jdbcDF.createOrReplaceTempView("ontime") scala> val sqlDF = sql("select min(yearD), max(yearD) as max_year, Carrier, count(*) as cnt, sum(if(ArrDelayMinutes>30, 1, 0)) as flights_delayed, round(sum(if(ArrDelayMinutes>30, 1, 0))/count(*),2) as rate FROM ontime WHERE OriginState not in ('AK', 'HI', 'PR', 'VI') and DestState not in ('AK', 'HI', 'PR', 'VI') GROUP by carrier HAVING cnt > 1000 and max_year > '1990' ORDER by rate DESC, cnt desc LIMIT 10") sqlDF: org.apache.spark.sql.DataFrame = [min(yearD): date, max_year: date ... 4 more fields] scala> sqlDF.show() +----------+--------+-------+--------+---------------+----+ |min(yearD)|max_year|Carrier| cnt|flights_delayed|rate| +----------+--------+-------+--------+---------------+----+ | 2003| 2013| EV| 2962008| 464264|0.16| | 2003| 2013| B6| 1237400| 187863|0.15| | 2006| 2011| XE| 1615266| 230977|0.14| | 2003| 2005| DH| 501056| 69833|0.14| | 2001| 2013| MQ| 4518106| 605698|0.13| | 2003| 2013| FL| 1692887| 212069|0.13| | 2004| 2010| OH| 1307404| 175258|0.13| | 2006| 2013| YV| 1121025| 143597|0.13| | 2003| 2006| RU| 1007248| 126733|0.13| | 1988| 2013| UA|10717383| 1327196|0.12| +----------+--------+-------+--------+---------------+----+

spark-shell does not show the query time. This can be retrieved from Web UI or from spark-sql. I’ve re-run the same query in spark-sql:

./bin/spark-sql --driver-memory 4G --master spark://thor:7077 spark-sql> CREATE TEMPORARY VIEW ontime > USING org.apache.spark.sql.jdbc > OPTIONS ( > url "jdbc:mysql://localhost:3306/ontime?user=root&password=", > dbtable "ontime.ontime_part", > fetchSize "1000", > partitionColumn "yearD", lowerBound "1988", upperBound "2014", numPartitions "48" > ); 16/08/04 01:44:27 WARN JDBCRelation: The number of partitions is reduced because the specified number of partitions is less than the difference between upper bound and lower bound. Updated number of partitions: 26; Input number of partitions: 48; Lower bound: 1988; Upper bound: 2014. Time taken: 3.864 seconds spark-sql> select min(yearD), max(yearD) as max_year, Carrier, count(*) as cnt, sum(if(ArrDelayMinutes>30, 1, 0)) as flights_delayed, round(sum(if(ArrDelayMinutes>30, 1, 0))/count(*),2) as rate FROM ontime WHERE DayOfWeek not in (6,7) and OriginState not in ('AK', 'HI', 'PR', 'VI') and DestState not in ('AK', 'HI', 'PR', 'VI') GROUP by carrier HAVING cnt > 1000 and max_year > '1990' ORDER by rate DESC, cnt desc LIMIT 10; 16/08/04 01:45:13 WARN Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf. 2003 2013 EV 2962008 464264 0.16 2003 2013 B6 1237400 187863 0.15 2006 2011 XE 1615266 230977 0.14 2003 2005 DH 501056 69833 0.14 2001 2013 MQ 4518106 605698 0.13 2003 2013 FL 1692887 212069 0.13 2004 2010 OH 1307404 175258 0.13 2006 2013 YV 1121025 143597 0.13 2003 2006 RU 1007248 126733 0.13 1988 2013 UA 10717383 1327196 0.12 Time taken: 139.628 seconds, Fetched 10 row(s)

So the response time of the same query is almost 10x faster (on the same server, just one box). But now how was this query translated to MySQL queries, and why it is so much faster? Here is what is happening inside MySQL:

Inside MySQL

Spark:

scala> sqlDF.show() [Stage 4:> (0 + 26) / 26]

MySQL:

mysql> select id, info from information_schema.processlist where info is not NULL and info not like '%information_schema%'; +-------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | id | info | +-------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | 10948 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 2001 AND yearD < 2002) | | 10965 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 2007 AND yearD < 2008) | | 10966 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 1991 AND yearD < 1992) | | 10967 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 1994 AND yearD < 1995) | | 10968 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 1998 AND yearD < 1999) | | 10969 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 2010 AND yearD < 2011) | | 10970 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 2002 AND yearD < 2003) | | 10971 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 2006 AND yearD < 2007) | | 10972 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 1990 AND yearD < 1991) | | 10953 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 2009 AND yearD < 2010) | | 10947 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 1993 AND yearD < 1994) | | 10956 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD < 1989 or yearD is null) | | 10951 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 2005 AND yearD < 2006) | | 10954 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 1996 AND yearD < 1997) | | 10955 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 2008 AND yearD < 2009) | | 10961 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 1999 AND yearD < 2000) | | 10962 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 2011 AND yearD < 2012) | | 10963 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 2003 AND yearD < 2004) | | 10964 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 1995 AND yearD < 1996) | | 10957 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 2004 AND yearD < 2005) | | 10949 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 1989 AND yearD < 1990) | | 10950 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 1997 AND yearD < 1998) | | 10952 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 2013) | | 10958 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 1992 AND yearD < 1993) | | 10960 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 2000 AND yearD < 2001) | | 10959 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 2012 AND yearD < 2013) | +-------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ 26 rows in set (0.00 sec)

Spark is running 26 queries in parallel, which is great. As the table is partitioned it only uses one partition per query, but scans the whole partition:

mysql> explain partitions SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 2001 AND yearD < 2002)G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: ontime_part partitions: p2001 type: ALL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: 5814106 Extra: Using where 1 row in set (0.00 sec)

In this case, as the box has 12 CPU cores / 24 threads, it efficently executes 26 queries in parallel and the partitioned table helps to avoid contention issues (I wish MySQL could scan partitions in parallel, but it can’t at the time of writing).

Another interesting thing is that Spark can “push down” some of the conditions to MySQL, but only those inside the “where” clause. All group by/order by/aggregations are done inside Spark. It  needs to retrieve data from MySQL to satisfy those conditions and will not push down group by/order by/etc to MySQL.

That also means that queries without “where” conditions (for example “select count(*) as cnt, carrier from ontime group by carrier order by cnt desc limit 10”) will have to retrieve all data from MySQL and load it to Spark (as opposed to MySQL will do all group by inside). Running it in Spark might be slower or faster (depending on the amount of data and use of indexes) but it also requires more resources and potentially more memory dedicated for Spark. The above query is translated to 26 queries, each does a “select carrier from ontime_part where (yearD >= N AND yearD < N)”

Pushing down the whole query into MySQL 

If we want to avoid sending all data from MySQL to Spark we have the option of creating a temporary table on top of a query (similar to MySQL’s create temporary table as select …). In Scala:

val tableQuery = "(select yeard, count(*) from ontime group by yeard) tmp" val jdbcDFtmp = spark.read.format("jdbc").options( Map("url" -> "jdbc:mysql://localhost:3306/ontime?user=root&password=", "dbtable" -> tableQuery, "fetchSize" -> "10000" )).load() jdbcDFtmp.createOrReplaceTempView("ontime_tmp")

In Spark SQL:

CREATE TEMPORARY VIEW ontime_tmp USING org.apache.spark.sql.jdbc OPTIONS ( url "jdbc:mysql://localhost:3306/ontime?user=root&password=mysql", dbtable "(select yeard, count(*) from ontime_part group by yeard) tmp", fetchSize "1000" ); select * from ontime_tmp;

Please note:

  1. We do not want to use “partitionColumn” here, otherwise we will see 26 queries like this in MySQL: “SELECT yeard, count(*) FROM (select yeard, count(*) from ontime_part group by yeard) tmp where (yearD >= N AND yearD < N)” (obviously not optimal)
  2. This is not a good use of Spark, more like a “hack.” The only good reason to do it is to be able to have the result of the query as a source of an additional query.
Query cache in Spark

Another option is to cache the result of the query (or even the whole table) and then use .filter in Scala for faster processing. This requires sufficient memory dedicated for Spark. The good news is we can add additional nodes to Spark and get more memory for Spark cluster.

Spark SQL example:

CREATE TEMPORARY VIEW ontime_latest USING org.apache.spark.sql.jdbc OPTIONS ( url "jdbc:mysql://localhost:3306/ontime?user=root&password=", dbtable "ontime.ontime_part partition (p2013, p2014)", fetchSize "1000", partitionColumn "yearD", lowerBound "1988", upperBound "2014", numPartitions "26" ); cache table ontime_latest; spark-sql> cache table ontime_latest; Time taken: 465.076 seconds spark-sql> select count(*) from ontime_latest; 5349447 Time taken: 0.526 seconds, Fetched 1 row(s) spark-sql> select count(*), dayofweek from ontime_latest group by dayofweek; 790896 1 634664 6 795540 3 794667 5 808243 4 743282 7 782155 2 Time taken: 0.541 seconds, Fetched 7 row(s) spark-sql> select min(yearD), max(yearD) as max_year, Carrier, count(*) as cnt, sum(if(ArrDelayMinutes>30, 1, 0)) as flights_delayed, round(sum(if(ArrDelayMinutes>30, 1, 0))/count(*),2) as rate FROM ontime_latest WHERE DayOfWeek not in (6,7) and OriginState not in ('AK', 'HI', 'PR', 'VI') and DestState not in ('AK', 'HI', 'PR', 'VI') and (origin='RDU' or dest = 'RDU') GROUP by carrier HAVING cnt > 1000 and max_year > '1990' ORDER by rate DESC, cnt desc LIMIT 10; 2013 2013 MQ 9339 1734 0.19 2013 2013 B6 3302 516 0.16 2013 2013 EV 9225 1331 0.14 2013 2013 UA 1317 177 0.13 2013 2013 AA 5354 620 0.12 2013 2013 9E 5520 593 0.11 2013 2013 WN 10968 1130 0.1 2013 2013 US 5722 549 0.1 2013 2013 DL 6313 478 0.08 2013 2013 FL 2433 205 0.08 Time taken: 2.036 seconds, Fetched 10 row(s)

Here we cache partitions p2013 and p2014 in Spark. This retrieves the data from MySQL and loads it in Spark. After that all queries run on the cached data and will be much faster.

With Scala we can cache the result of a query and then use filters to only get the information we need:

val sqlDF = sql("SELECT flightdate, origin, dest, depdelayminutes, arrdelayminutes, carrier, TailNum, Cancelled, Diverted, Distance from ontime") sqlDF.cache().show() scala> sqlDF.filter("flightdate='1988-01-01'").count() res5: Long = 862

Using Spark with Percona XtraDB Cluster

As Spark can be used in a cluster mode and scale with more and more nodes, reading data from a single MySQL is a bottleneck. We can use MySQL replication slave servers or Percona XtraDB Cluster (PXC) nodes as a Spark datasource. To test it out, I’ve provisioned Percona XtraDB Cluster with three nodes on AWS (I’ve used m4.2xlarge Ubuntu instances) and also started Apache Spark on each node:

  1. Node1 (pxc1): Percona Server + Spark Master + Spark worker node + Spark SQL running
  2. Node2 (pxc2): Percona Server + Spark worker node
  3. Node3 (pxc3): Percona Server + Spark worker node

All the Spark worker nodes use the memory configuration option:

cat conf/spark-env.sh export SPARK_WORKER_MEMORY=24g

Then I can start spark-sql (also need to have connector/J JAR file copied to all nodes):

$ ./bin/spark-sql --driver-memory 4G --master spark://pxc1:7077

When creating a table, I still use localhost to connect to MySQL (url “jdbc:mysql://localhost:3306/ontime?user=root&password=xxx”). As Spark worker nodes are running on the same instance as Percona Cluster nodes, it will use the local connection. Then running a Spark SQL will evenly distribute all 26 MySQL queries among the three MySQL nodes.

Alternatively we can run Spark cluster on a separate host and connect it to the HA Proxy, which in turn will load balance selects across multiple Percona XtraDB Cluster nodes.

Query Performance Benchmark

Finally, here is the query response time test on the three AWS Percona XtraDB Cluster nodes:

Query 1: select min(yearD), max(yearD) as max_year, Carrier, count(*) as cnt, sum(if(ArrDelayMinutes>30, 1, 0)) as flights_delayed, round(sum(if(ArrDelayMinutes>30, 1, 0))/count(*),2) as rate FROM ontime_part WHERE DayOfWeek not in (6,7) and OriginState not in ('AK', 'HI', 'PR', 'VI') and DestState not in ('AK', 'HI', 'PR', 'VI') GROUP by carrier HAVING cnt > 1000 and max_year > '1990' ORDER by rate DESC, cnt desc LIMIT 10;

Query / Index type MySQL Time Spark Time (3 nodes) Times Improvement No covered index (partitioned) 19 min 16.58 sec 192.17 sec 6.02 Covered index (partitioned) 2 min 10.81 sec 48.38 sec 2.7

 

Query 2: select dayofweek, count(*) from ontime_part group by dayofweek;

Query / Index type MySQL Time Spark Time (3 nodes) Times Improvement No covered index (partitoned) 19 min 15.21 sec 195.058 sec 5.92 Covered index (partitioned) 1 min 10.38 sec 27.323 sec 2.58

 

Now, this looks really good, but it can be better. With three nodes @ m4.2xlarge we will have 8*3 = 24 cores total (although they are shared between Spark and MySQL). We can expect 10x improvement, especially without a covered index.

However, on m4.2xlarge the amount of RAM did not allow me to run MySQL out of memory, so all reads were from EBS non-provisioned IOPS, which only gave me ~120MB/sec. I’ve redone the test on a set of three dedicated servers:

  • 28 cores E5-2683 v3 @ 2.00GHz
  • 240GB of RAM
  • Samsung 850 PRO

The test was running completely off RAM:

Query 1 (from the above)

Query / Index type MySQL Time Spark Time (3 nodes) Times Improvement No covered index (partitoned) 3 min 13.94 sec 14.255 sec 13.61 Covered index (partitioned) 2 min 2.11 sec 9.035 sec 13.52

 

Query 2: select dayofweek, count(*) from ontime_part group by dayofweek;

Query / Index type MySQL Time Spark Time (3 nodes) Times Improvement No covered index (partitoned)  2 min 0.36 sec 7.055 sec 17.06 Covered index (partitioned) 1 min 6.85 sec 4.514 sec 14.81

 

With this amount of cores and running out of RAM we actually do not have enough concurrency as the table only have 26 partitions. I’ve tried the unpartitioned table with ID primary key and use 128 partitions.

Note about partitioning

I’ve used partitioned table (partition by year) in my tests to help reduce MySQL level contention. At the same time the “partitionColumn” option in Spark does not require that MySQL table is partitioned. For example, if a table has a primary key, we can use this CREATE VIEW in Spark :

CREATE OR REPLACE TEMPORARY VIEW ontime USING org.apache.spark.sql.jdbc OPTIONS ( url "jdbc:mysql://127.0.0.1:3306/ontime?user=root&password=", dbtable "ontime.ontime", fetchSize "1000", partitionColumn "id", lowerBound "1", upperBound "162668934", numPartitions "128" );

Assuming we have enough MySQL servers (i.e., nodes or slaves), we can increase the number of partitions and that can improve the parallelism (as opposed to only 26 partitions when running one partition by year). Actually, the above test gives us even better response time: 6.44 seconds for query 1.

Where Spark doesn’t work well

For faster queries (those that use indexes or can efficiently use an index) it does not make sense to use Spark. Retrieving data from MySQL and loading it into Spark is not free. This overhead can be significant for faster queries. For example, a query like this select count(*) from ontime_part where YearD = 2013 and DayOfWeek = 7 and OriginState = 'NC' and DestState = 'NC'; will only scan 1300 rows and will return instant (0.00 seconds reported by MySQL).

An even better example is this: select max(id) from ontime_part. In MySQL, the query will use the index and all calculations will be done inside MySQL. Spark, on the other hand, will have to retrieve all IDs (select id from ontime_part) from MySQL and calculate maximum. That took 24.267 seconds.

Conclusion

Using Apache Spark as an additional engine level on top of MySQL can help to speed up the slow reporting queries and add much-needed scalability for the long running select queries. In addition, Spark can help with query caching for frequent queries.

PS: Visual explain plan with Spark

Spark Web GUI provides lots of ways of monitoring Spark jobs. For example, it shows the “job” progress:

And SQL visual explain details:



General Inquiries

For general inquiries, please send us your question and someone will contact you.