October 22, 2014

Reference architecture for a write-intensive MySQL deployment

We designed Percona Cloud Tools (both hardware and software setup) to handle a very high-intensive MySQL write workload. For example, we already observe inserts of 1bln+ datapoints per day. So I wanted to share what kind of hardware we use to achieve this result.

Let me describe what we use, and later I will explain why.

Server:

  • Chassis: Supermicro SC825TQ-R740LPB 2U Rackmount Chassis
  • Motherboard: Supermicro X9DRI-F dual socket
  • CPU: Dual Intel Xeon Ivy Bridge E5-2643v2 (6x 3.5Ghz cores, 12x HT cores, 25M L3)
  • Memory: 256GB (16x 16GB 256-bit quad-channel) ECC registered DDR3-1600
  • Raid: LSI MegaRAID 9260-4i 4-port 6G/s hardware RAID controller, 512M buffer
  • MainStorage: PCIe SSD HGST FlashMAX II 4.8TB
  • Secondary Storage (OS, logs): RAID 1 over 2x 3TB hard drives

Software:

When selecting hardware for your application, you need to look at many aspects – typically you’re looking for a solution for which you already have experience in working with and has also proved to be the most efficient option. For us it has been as follows:

Cloud vs Bare Metal
We have experience having hardware hosted at the data center as well as cash for upfront investments in hardware so we decided to go for physical self-hosted hardware instead of the cloud. Going this route also gave us maximum flexibility in choosing a hardware setup that was the most optimal for our application rather than selecting one of the stock options.

Scale Up vs Scale Out
We have designed a system from scratch to be able to utilize multiple servers through sharding – so our main concern is choosing the most optimal configuration for the server and provisioning servers as needed. In addition to raw performance we also need to consider power usage and overhead of managing many servers which typically makes having slightly more high-end hardware worth it.

Resource Usage
Every application uses resources in different ways so an optimal configuration will be different depending on your application. Yet all applications use the same resources you need to consider. Typically you want to plan for all of your resources to be substantially used – providing some margin for spikes and maintenance.

CPU

  • Our application processes a lot of data and uses the TokuDB storage engine which uses a lot of CPU for compression, so we needed powerful CPUs.
  • Many MySQL functions are not parallel, think executing single query or Alter table so we’re going for CPU with faster cores rather than larger amount of cores. The resulting configuration with 2 sockets giving 12 cores and 24 threads is good enough for our workloads.
  • Lower end CPUs such as Xeon E3 have very attractive price/performance but only support 32GB of memory which was not enough for our application.

Memory

  • For database boxes memory is mainly used as a cache, so depending on your application you may be better off investing in memory or storage for optimal performance. Check out this blog post for more details.
  • Accessing data in memory is much faster than even on the fastest flash storage so it is still important.
    For our workload having recent data in memory is very important so we get as much “cheap” memory as we can populating all 16 slots with 16GB dimms which have attractive cost per GB at this point.

Storage
There are multiple uses for the storage so there are many variables to consider

  • Bandwidth
    • We need to be able access data on the storage device quickly and with stable response time. HGST FlashMax II has been able to meet these very demanding needs.
  • Endurance
    • When using flash storage you need to worry about endurance – how much beating with writes flash storage can handle before it wears out. Some low cost MLC SSDs would wear out in the time frame of weeks if being written with maximum speed. HGST FlashMax II has endurance rating of 10 Petabytes written (for a random workload) – 30 Petabytes written (for a sequential workload)
    • We also use TokuDB storage engine which significantly reduces amount of writes compared to Innodb.
  • Durability
    • Does the storage provide true durability with data guaranteed to be persisted when write is acknowledged at the operating system level when power goes down or is loss possible?
      We do not want to risk database corruption in case of power failure so we were looking for storage solution which guarantees durability.
      HGST FlashMax II guarantees durability which has been confirmed by our stress tests.
  • Size
    • To scale application storage demands you need to scale both number of IO operations storage can handle and storage size. For flash storage it is often the size which becomes limiting factor.
      HGST FlashMax II 4.8 TB capacity is best available on the market which allows us to go “All Flash” and achieve very quick data access to all our data set.
  • Secondary Storage
    • Not every application need requires flash storage properties.
    • We have secondary storage with conventional drives for operating system and logs.
      Sequential read/write pattern works well with low cost conventional drives and also allow us to increase flash life time, having it handling less writes.
    • We’re using RAID with BBU for secondary storage to be able to have fully durable binary logs without paying high performance penalty.

Why PCIe SSD over SATA SSD?
There are arguments that SATA SSD provides just a good enough performance for MySQL and there is no need for PCIe. While these arguments are valid in one dimension, there are several more to consider.

First, like I said PCIe SSD still provides a best absolute response time and it is an important factor for an end user experience in SaaS systems like Percona Cloud Tools.
Second, consider maintenance operations like backup, ALTER TABLES or slave setups. While these operations are boring and do not get as much attention as a response time or throughput in benchmarks, it is still operations that DBAs performs basically daily, and it is very important to finish a backup or ALTER TABLE in a predictable time, especially on 3-4TB datasize range. And this is where PCIe SSD performs much better than SATA SSDs. For SATA SSD, especially bigger size, write endurance is another point of concern.

Why TokuDB engine?
The TokuDB engine is the best when it comes to insert operations to a huge dataset, and few more factors makes it a no-brainer:

  • TokuDB compression is a huge win. I estimate into this storage ( FlashMAX II 4.8TB) we will fit about 20-30TB of raw data.
  • TokuDB is SSD friendly, as it performs much less data writes per INSERT operation than InnoDB, which greatly extends SSD (which is, well, expensive to say the least) lifetime.
About Vadim Tkachenko

Vadim leads Percona's development group, which produces Percona Clould Tools, the Percona Server, Percona XraDB Cluster and Percona XtraBackup. He is an expert in solid-state storage, and has helped many hardware and software providers succeed in the MySQL market.

Comments

  1. The TokuDB user community needs more blog posts on best practices — workloads, tuning, schema, etc. I look forward to the next post.

  2. I am curious about a couple of other issues. First, what file system is being used and does it matter? Second, what is the replication configuration of this database?

  3. Rolf says:

    Great writeup. As Mark mentioned, would love to hear more about the TokuDB use. Why the move to LVM for backups? Is the write speed tradeoff negated by SSD? Has the copy on write been factored into the lifespan of the SSDs?

  4. Hi Vadim,

    How do you handle the networking issues? How do you handle the traffic in and out of the server for 1 billion records/day (I’m assuming that a datapoint is the same as a record in the table – also, could you please give some insight of your tables?). I think the network architecture (and the elements in that architecture) is very important.

    How do you handle SPOF issues? Do you have a backup server for that server?

  5. Nils says:

    Why choose LSI for the controller? I only remember having had bad experience with the command line tool on those.

  6. To answer on all questions:

    – We use MySQL replication (master-slave pairs), and backup is taken from Slaves.
    – Filesystem is ext4. For me xfs and ext4 are almost equal to use under MySQL, but recently I saw few stability and performance issues with xfs in CentOS 6, which makes ext4 little more preferable.

    – as for LVM backup for TokuDB, I explained it there http://www.mysqlperformanceblog.com/2014/07/15/tokudb-tips-mysql-backups/ . FlashMAX II gives 10-30PB lifetime, so amount of writes is a less worry, especially given TokuDB is SSD friendly.

    – We use LSI cards a lot, and we do not have much problems with these. yes, command line tool is a bit weird, but it is fair to say we are happy with LSI hardware quality and performance.

  7. ives says:

    vadim,

    any insight into partitioning & layout optimization for read/write patterns?

    – transaction logs
    – master binary logs
    – replication binary logs
    – tablespaces
    – temp tables (number of volumes, type of storage)

Speak Your Mind

*