Where the open source community meets: Secure your spot for Percona Live Amsterdam! - Register

Downloads

Blog

Linux OS Tuning for MySQL Database Performance

July 3, 2018

Author

Share this Post:

Linux OS tuning for MySQL database performance In this post, we will review the most important settings for Linux performance tuning to adjust for optimization of a MySQL database server. We’ll note how some of the Linux parameter settings used OS tuning may vary according to different system types: physical, virtual or cloud. Other posts have addressed MySQL parameters, like Alexander’s blog MySQL 5.7 Performance Tuning Immediately After Installation. That post remains highly relevant for the latest versions of MySQL, 5.7 and 8.0. Here we will focus more on the Linux operating system parameters that can affect database performance.

Server and Operating System

Here are some Linux parameters that you should check and consider modifying if you need to improve database performance.

Kernel – vm.swappiness

The value represents the tendency of the kernel to swap out memory pages. On a database server with ample amounts of RAM, we should keep this value as low as possible. The extra I/O can slow down or even render the service unresponsive. A value of 0 disables swapping completely while 1 causes the kernel to perform the minimum amount of swapping. In most cases the latter setting should be OK:

# Set the swappiness value as root
echo 1 > /proc/sys/vm/swappiness

# Alternatively, using sysctl
sysctl -w vm.swappiness=1

# Verify the change
cat /proc/sys/vm/swappiness
1

# Alternatively, using sysctl
sysctl vm.swappiness
vm.swappiness = 1

# Set the swappiness value as root

echo 1 > /proc/sys/vm/swappiness

# Alternatively, using sysctl

sysctl -w vm.swappiness=1

# Verify the change

cat /proc/sys/vm/swappiness

# Alternatively, using sysctl

sysctl vm.swappiness

vm.swappiness = 1

The change should be also persisted in /etc/sysctl.conf:

vm.swappiness = 1

1	vm.swappiness = 1

Filesystems – XFS/ext4/ZFS

XFS

XFS is a high-performance, journaling file system designed for high scalability. It provides near-native I/O performance even when the file system spans multiple storage devices. XFS has features that make it suitable for very large file systems, supporting files up to 8EiB in size. Fast recovery, fast transactions, delayed allocation for reduced fragmentation and near raw I/O performance with DIRECT I/O.

The default options for mkfs.xfs are good for optimal speed, so the simple command:

# Use default mkfs options
mkfs.xfs /dev/target_volume

1 2	# Use default mkfs options mkfs.xfs /dev/target_volume

will provide the best performance while ensuring data safety. Regarding mount options, the defaults should fit most cases. On some filesystems you can see a performance increase by adding the noatime mount option to the /etc/fstab. For XFS filesystems the default atime behavior is relatime, which has almost no overhead compared to noatime and still maintains sane atime values. If you create an XFS file system on a LUN that has a battery-backed, non-volatile cache, you can further increase the performance of the filesystem by disabling the write barrier with the mount option nobarrier. This helps you to avoid flushing data more often than necessary. If a BBU (backup battery unit) is not present, however, or you are unsure about it, leave barriers on, otherwise, you may jeopardize data consistency. With this option on, an /etc/fstab file should look like the one below:

/dev/sda2              /datastore              xfs     defaults,nobarrier
/dev/sdb2              /binlog                 xfs     defaults,nobarrier

1 2	/dev/sda2 /datastore xfs defaults,nobarrier /dev/sdb2 /binlog xfs defaults,nobarrier

ext4

ext4 has been developed as the successor to ext3 with added performance improvements. It is a solid option that will fit most workloads. We should note here that it supports files up to 16TB in size, a smaller limit than xfs. This is something you should consider if extreme table space size/growth is a requirement. Regarding mount options, the same considerations apply. We recommend the defaults for a robust filesystem without risks to data consistency. However, if an enterprise storage controller with a BBU cache is present, the following mount options will provide the best performance:

/dev/sda2              /datastore              ext4     noatime,data=writeback,barrier=0,nobh,errors=remount-ro
/dev/sdb2              /binlog                 ext4     noatime,data=writeback,barrier=0,nobh,errors=remount-ro

1 2	/dev/sda2 /datastore ext4 noatime,data=writeback,barrier=0,nobh,errors=remount-ro /dev/sdb2 /binlog ext4 noatime,data=writeback,barrier=0,nobh,errors=remount-ro

Note: The data=writeback option results in only metadata being journaled, not actual file data. This has the risk of corrupting recently modified files in the event of a sudden power loss, a risk which is minimized with the presence of a BBU enabled controller. nobh only works with the data=writeback option enabled.

ZFS

ZFS is a filesystem and LVM combined enterprise storage solution with extended protection vs data corruption. There are certainly cases where the rich feature set of ZFS makes it an essential option to consider, most notably when advance volume management is a requirement. ZFS tuning for MySQL can be a complex topic and falls outside the scope of this blog. For further reference, there is a dedicated blog post on the subject by Yves Trudeau:

Disk Subsystem – I/O scheduler

Most modern Linux distributions come with noop or deadline I/O schedulers by default, both providing better performance than the cfq and anticipatory ones. However, it is always a good practice to check the scheduler for each device and if the value shown is different than noop or deadline the policy can change without rebooting the server:

# View the I/O scheduler setting. The value in square brackets shows the running scheduler
cat /sys/block/sdb/queue/scheduler 
noop deadline [cfq]

# Change the setting
sudo echo noop > /sys/block/sdb/queue/scheduler

# View the I/O scheduler setting. The value in square brackets shows the running scheduler

cat /sys/block/sdb/queue/scheduler

noop deadline [cfq]

# Change the setting

sudo echo noop > /sys/block/sdb/queue/scheduler

To make the change persistent, you must modify the GRUB configuration file:

# Change the line:
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash"

# to:
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash elevator=noop"

# Change the line:

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash"

# to:

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash elevator=noop"

AWS Note: There are cases where the I/O scheduler has a value of none, most notably in AWS VM instance types where EBS volumes are exposed as NVMe block devices. This is because the setting has no use in modern PCIe/NVMe devices. The reason is that they have a very large internal queue and they bypass the IO scheduler altogether. The setting in this case is none and it is optimal in such disks.

Disk Subsystem – Volume optimization

Ideally, different disk volumes should be used for the OS installation, binlog, data, and the redo log, if this is possible. The separation of OS and data partitions, not just logically but physically, will improve database performance. The RAID level can also have an impact: RAID-5 should be avoided as the checksum needed to ensure integrity is costly. The best performance without making compromises to redundancy is achieved by the use of an advanced controller with a battery-backed cache unit and preferably RAID-10 volumes spanned across multiple disks.

AWS Note: For further information about EBS volumes and AWS storage optimization, Amazon has documentation at the following links:

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/nvme-ebs-volumes.html

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/storage-optimized-instances.html

Database settings

System Architecture – NUMA settings

Non-uniform memory access (NUMA) is a memory design where an SMP’s system processor can access its own local memory faster than non-local memory (the one assigned local to other CPUs). This may result in suboptimal database performance and potentially swapping. When the buffer pool memory allocation is larger than the size of the RAM available local to the node, and the default memory allocation policy is selected, swapping occurs. A NUMA enabled server will report different node distances between CPU nodes. A uniformed one will report a single distance:

# NUMA system
numactl --hardware

available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 65525 MB
node 0 free: 296 MB
node 1 cpus: 8 9 10 11 12 13 14 15
node 1 size: 65536 MB
node 1 free: 9538 MB
node 2 cpus: 16 17 18 19 20 21 22 23
node 2 size: 65536 MB
node 2 free: 12701 MB
node 3 cpus: 24 25 26 27 28 29 30 31
node 3 size: 65535 MB
node 3 free: 7166 MB
node distances:

node   0   1   2   3
  0:  10  20  20  20
  1:  20  10  20  20
  2:  20  20  10  20
  3:  20  20  20  10

# Uniformed system
numactl --hardware

available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 64509 MB
node 0 free: 4870 MB
node distances:

node   0
  0:  10

# NUMA system

numactl --hardware

available: 4 nodes (0-3)

node 0 cpus: 0 1 2 3 4 5 6 7

node 0 size: 65525 MB

node 0 free: 296 MB

node 1 cpus: 8 9 10 11 12 13 14 15

node 1 size: 65536 MB

node 1 free: 9538 MB

node 2 cpus: 16 17 18 19 20 21 22 23

node 2 size: 65536 MB

node 2 free: 12701 MB

node 3 cpus: 24 25 26 27 28 29 30 31

node 3 size: 65535 MB

node 3 free: 7166 MB

node distances:

node 0 1 2 3

0: 10 20 20 20

1: 20 10 20 20

2: 20 20 10 20

3: 20 20 20 10

# Uniformed system

numactl --hardware

available: 1 nodes (0)

node 0 cpus: 0 1 2 3 4 5 6 7

node 0 size: 64509 MB

node 0 free: 4870 MB

node distances:

node 0

0: 10

In the case of a NUMA system, where numactl shows different distances across nodes, the MySQL variable innodb_numa_interleave should be enabled to ensure memory interleaving. Percona Server provides improved NUMA support by introducing the flush_caches variable. When enabled, it will help with allocation fairness across nodes. To determine whether or not allocation is equal across nodes, you can examine numa_maps for the mysqld process with this script:

# The perl script numa_maps.pl will report memory allocation per CPU node:
# 3595 is the pid of the mysqld process
perl numa_maps.pl < /proc/3595/numa_maps

N0        :     16010293 ( 61.07 GB)
N1        :     10465257 ( 39.92 GB)
N2        :     13036896 ( 49.73 GB)
N3        :     14508505 ( 55.35 GB)
active    :          438 (  0.00 GB)
anon      :     54018275 (206.06 GB)
dirty     :     54018275 (206.06 GB)
kernelpagesize_kB:         4680 (  0.02 GB)
mapmax    :          787 (  0.00 GB)
mapped    :         2731 (  0.01 GB)

# The perl script numa_maps.pl will report memory allocation per CPU node:

# 3595 is the pid of the mysqld process

perl numa_maps.pl < /proc/3595/numa_maps

N0 : 16010293 ( 61.07 GB)

N1 : 10465257 ( 39.92 GB)

N2 : 13036896 ( 49.73 GB)

N3 : 14508505 ( 55.35 GB)

active : 438 ( 0.00 GB)

anon : 54018275 (206.06 GB)

dirty : 54018275 (206.06 GB)

kernelpagesize_kB: 4680 ( 0.02 GB)

mapmax : 787 ( 0.00 GB)

mapped : 2731 ( 0.01 GB)

While you are here …

It’s important to know whether your workload is read intensive or write intensive, so you can make the right hardware choices, database configurations and utilize the appropriate techniques for performance optimization and scalability. Percona CEO Peter Zaitsev offers guidance on this topic by analyzing read/write workloads by counts and response times. Read his blog to learn more.

Learn more about diagnosing and solving database performance issues in our free eBook, “6 Common Causes of Poor Database Performance”

4 1 vote

Article Rating

5 Comments

Oldest

Newest Most Voted

Editor

Alexey Kopytov

8 years ago

On modern distributions one can also use the numastat utility instead of a custom Perl script to get a summary of memory distribution across NUMA nodes (both system-wide and per-process).

Mark Callaghan

8 years ago

xfs might ignore nobarrier in recent kernels, mount still succeeds but dmesg shows the error. See http://smalldatum.blogspot.com/2018/01/xfs-nobarrier-and-413-linux-kernel.html

SuperQ

8 years ago

One thing I’ve done to improve memory access issues on bare metal nodes was to use hugepage support[0].

* By pre-allocating hugepages, we can be sure that we have good NUMA balancing on multi-socket hardware for the innodb memory. At the same time allow local memory allocation for various MySQL threads, rather than using forcible interleaving which may not be optimal based on the thread execution location.
* Hugepages, by their nature, have much lower access overhead due to elimination of the TLB lookup path. Since MySQL manages memory in innodb, this has a nice performance boost for memory access. (lower latency, reduced CPU use)

The only really annoying part is the setup, as nobody seems to know the exact math to determine exactly how much huge page memory you need, for what flag settings in MySQL.

https://dev.mysql.com/doc/refman/8.0/en/large-page-support.html

SuperQ

8 years ago

Not performance related, but safety for production databases, especially on dedicated servers.

Linux allows you to adjust the OOM killer based on memory use. Since MySQL is likely to be taking up a lot of memory intentionally, the normal OOM behavior is especially bad as it rates the largest process as the best thing to kill.

You can set the oom killer with echo NUM > /proc//oom_score_adj.

The value range from -1000 to 1000, where negative numbers are a “discount” based on the size, so -1000 would be NEVER kill, and 1000 would be ALWAYS kill.

I would probably avoid -1000, as a runaway MySQL could deadlock the system. But -950 would discount 95% of the memory against the badness score. This would allow the kernel to kill a runaway other job first.