November 21, 2014

Heikki Tuuri Innodb answers – Part I

Its almost a month since I promised Heikki Tuuri to answer Innodb Questions. Heikki is a busy man so I got answers to only some of the questions but as people still poking me about this I decided to publish the answers I have so far. Plus we may get some interesting follow up questions out of this.

I had added my comments to some of the questions. HT will stand for Heikki Tuuri in the answers and PZ for myself.

Q1: Why did you decide to go with gzip compression instead of implementing “classical” approach of index prefix compression ?

HT: The reason is that index prefix compression prevents one from using a binary search on a B-tree node. We could not compare a search key K to an index record R if we only knew the suffix of R. There are, of course, hybrid approaches where some prefix information is stored, and a search almost as efficient as a binary search can be performed.

But InnoDB’s adaptive hash indexes require that the prefix is stored in every record R. An adaptive hash index tries to guess where our search would end in the B-tree. To verify that we arrived at the right position, we need to know the whole record R.

Also, a gzip compression potentially compresses the database more than just prefix compression. The downside in gzip is increased usage of CPU time.

PZ: We will see how it plays when we can see more benchmarks. The big problems I see so far with this feature are ease of use and requirement to “guess” how your data will compress for the whole table. Among other things compression for data pages and uncompressed index pages can be significantly different.

Q2: Does Innodb has pointers to the next/previous page in index tree leaf pages ?

HT: Yes

PZ: It is great as it can help to make full table scans and index scans more efficient. I can’t wait for ability to create physically sorted indexes with Innodb (with index built by sort) this is when this should be helpful the most.

Q3: Does Innodb secondary Indexes trees are allocated in two extents same as primary key tree or stored in the same extent ?

HT: An InnoDB extent is a contiguous block of 64 pages, 16 kB each. Each B-tree index in InnoDB occupies two SEGMENTs. One segment is for the non-leaf nodes, and the other for leaf nodes. The first 32 pages of a segment are allocated from ‘fragment extents’ where any individual page can be allocated to whatever use in whatever segment. After that, a segment always reserves whole 64 page extents.

Thus, the answer to the question is: for small tables, everything can reside within a single extent. For big tables, a secondary index reserves different extents from the clustered index.

Q4: Does Innodb ever merges sequential pages together if they become almost empty

HT: Yes. If a page becomes less than half full, a merge is attempted.

btr0cur.h:

Q5: When Innodb free space becomes available for use within same object? When does Innodb make it available for other objects.

HT: Free space can always be recycled within a segment. A segment can release free space to other segments, if a 64 page extent becomes totally empty, and also it can release individual pages from its 32 ‘fragment extent’ pages.

PZ: This is the reason why when you delete a lot of scattered rows you may not see a “free space” in Innodb tablespace to grow significantly.

Q6: Does Innodb policy replacement algorithm takes into account page position in btree index ?

HT: No. It is a pure LRU algorithm.

PZ: This is one of the things which we would like to play with – I would expect serious improvements are possible especially on high IO pressure workloads.

Q7: Does Innodb has any protection from pages being overwritten in buffer pool by large full table scan

HT: No

PZ: Another possible area of optimization. I frequently see batch jobs killing server performance overtaking buffer pool. Though full table scan is only one of replacement policy optimizations possible.

Q8: How do you assess current state of Innodb scalability with multiple threads and multiple concurrent transactions ?

HT: The scalability in my opinion is ‘mediocre’, and we are working on improving it.

PZ: Thank you for honest acknowledgment and I can’t wait to get hold of further improved versions.

Q9: Do you favour database “self tuning” or prefer to expose tuning options to be available to the user.

HT: I favor self tuning.

PZ: I also like self tuning in theory. But I really like to have tools to steer the database behavior in the cases when self tuning gets it wrong.

Q10: When Innodb decides to schedule sequential read-ahead, random read ahead ?

HT: InnoDB schedules a sequential read-ahead when it notices an ascending access pattern that passes an extent boundary. Roughly, if most of the pages in the extent have been accessed, and in an ascending order of the
file address, then InnoDB schedules the read of all the 64 pages of the extent where the next index page in the ascending order is. InnoDB has a similar heuristic for descending index scans.

The InnoDB sequential read-ahead is not optimal at all. It should schedule the read of the next extent when we are in the middle of the processing of the previous extent.

Random read-ahead means that when InnoDB notices that if at least 13 pages in an extent have been accessed very recently, then InnoDB reads in all the rest of the pages in the extent.

PZ: Interesting enough, there was some work done by DIKU students improving algorithms used for read ahead and results were significant. They however were not merged in Innodb mainline yet.

Q11: If read-ahead is happening and Innodb needs to access one of the pages being fetched does it has to wait for whole read ahead request to complete

HT: Unfortunately, it often has to wait. A read-ahead of a 64 page extent is
often done as a single 1 MB file read.

PZ: “often” a bit stikes me here. I thought it would be always single read, typically 1MB in size. I should also mention a post on this topic once.

Q12: If read-ahead request was placed to the queue but did not start executing yet will innodb has to wait for it to complete or can just go ahead and read page it needs

HT: InnoDB has to wait.

PZ: This one is a bummer. I thought Innodb will make sure to schedule synchronous wait in front if read-ahead IO has not been started already. The interesting question it brings – how many read-ahead requests can be queued at once.

Q13: Is there any IO concurrency control – how many random and sequential read aheads can be happening at the same time

HT: All read-aheads are performed in the Unix version with a single thread. Only one read-ahead can be happening on Unix at a time. On Windows, InnoDB uses native async I/O, and can perform many read-aheads at the same time.

PZ: Another thing I’d like to experiment with. I would expect you can gain significantly with multiple IO threads on systems with large amount of hard drives. Google has Patches which let you to try it out.

Q14: Are there any plans to fix Innodb so it unlocks (or never locks) the rows if they were not matched by query where clause when it was not resolved by index

HT: That is already implemented in 5.1: if you set the transaction isolation level READ COMMITTED, then InnoDB normally does not lock the ‘gaps’, and it does not lock rows whose latest committed version does not match the
WHERE condition in a search. Please use this with care: you MUST use row-based binlogging and replication, and remember that the execution is not serializable.

PZ: Cool. This is one of little known new features in MySQL 5.1

Q15: How frequently does Innodb fuzzy checkpointing is activated

HT: InnoDB flushes about 128 dirty pages per flush. That means that under a heavy write load, a new flush and a checkpoint happens more than once per second.

PZ: Not what this answers question exactly. So is it activated as soon as 128 pages must be flushed ? If so how this is discovered when. I think fuzzy checkpointing one of little known aspects of Innodb operations while I see it causing problems every so often.

Q16: How Innodb decided how many pages to flush at each checkpoint interval

HT: Usually it is 128, or less.

PZ: Looks like another magic number to experiment with.

Q17: How InnoBD handles blobs/text fields (needs more info)
According to the documentation, InnoDB put first 768 chars of each text/blob in the page and will allocate some space outside of the page.

However:
1. if the total size of the row is less than 8100 bytes InnoDB will not allocate additional space for blobs, even if each blob is larger than 768.
2. InnoDB will allocate additional space outside of the page for _each_ blob (if we have 8 blobs 8xN bytes will be allocated)

Question:
How much space InnoDB allocates for each blob outside of the page?

HT: For each column that InnoDB needs to store ‘externally’, it allocates
whole 16 kB pages. That will cause a lot of waste of space if the fields
are less that 16 kB.

The ‘zip’ source code tree by Marko has removed most of the 768 byte
local storage in the record. In that source code tree, InnoDB only needs
to store locally a prefix of an indexed column.

PZ: I think it is also very interesting question what happens for blobs larger than 16K – is exact size allocated or also segment based allocation is used.

Q18: Innodb Q: Is Group commit still busted in 5.0/5.1?

HT: Yes. However, work has been done on it by Inaam.

Q19: INNODB uses it’s own internal buffer pool but it’s in conflict with the Linux kernel’s buffer pool. In high load situations where INNODB is using most of the system’s memory the kernel can decide (incorrectly) to swap out MySQL. There are two solutions here. Memlock and O_DIRECT.

Memlock is apparently unstable on Linux. O_DIRECT is an alternative but Linus hates it. What are your suggestions here?

HT: According to tests by Peter Zaitsev, O_DIRECT works well to remove double buffering. In distant future, I would like to have the memlock also.

PZ: memlock already works – it is global MySQL Server option, though it may have problems on Linux if you lock very large portion of physical memory. There is also third alternative – “huge pages” which can be used for buffer pool and which are not swappable. They also allow to reduce number of TLB cache misses. huge pages along however do not stop IO pressure and OS may swap out not buffer pool but some other important parts of MySQL process.

Q20: We’ve migrated to using INNODB so that it operates 100% out of memory. This way every write is serial so that we can see the full write speed of the disk for throughput.

INNODB uses fuzzy checkpointing to commit data to disk. The problem that we’ve seen in high IO scenearios is that it commits too often and we’re only seeing 33% of the raw disk write speed. Since the commit isn’t perfectly serial it’s having to seek on disk which slows down throughput.

If we could control the checkpointing rate we could use an 8G write ahead log and tell innodb to do a full write of the database (basically dump memory to disk) once every minute or two.

This way we’d see a 3x performance boost and it would write at 100MBps vs 33MBps.

Any plans to enable tuning of the checkpointing rate? Postgres exposes this data and allows the user to tune the checkpointing values.

HT: Hmm… we could tune the way InnoDB does the buffer pool flush. I think Yasufumi Kinoshita talked at Users’ Conference 2007 about his patch that makes InnoDB’s flushes smoother and increase performance substantially.

I assume there is lots of room to tune the flushes, since I never optimized the algorithm under a realistic workload.

Making the doublewrite buffer bigger than 128 pages would require a bit more work. Now it is allocated permanently in the system tablespace when an InnoDB instance is created.

PZ: Sequential “dump” of buffer pool would need more optimizations to work – Innodb would also need also to write not-modified pages if this will get sequential write. Say if we have 1MB segment with all pages in buffer pool it well may be faster to do 1 write than flush 10 non sequential pages from it which were modified.

Q21: What’s the status of INNODB in 5.1.x?
Specifically: What’s the current status of innodb_thread_concurrency ? Is it suggested to set this value to a LARGE value (somewhat like 200) for additional throughput?

HT: ha_innodb.cc in 5.1:

The default is 8, which will work well in most cases. If you are not seeing ‘thread thrashing’ (lots of threads waiting for semaphores in SHOW INNODB STATUS), you can try to disable thread throttling completely
by setting the value 0.

But if you see thread thrashing, then a value 1 or 2 often solves your thrashing problem.

PZ: Values 1,2 may solve your thrashing problem but also will limit innodb to be able to use only couple of CPUs efficiently. Thought even that is often better than bad thrashing.

Q22: Also, What’s the deal with group commit being broken in 5.1.x? Is it possible to get the same benefit if you’re performing INSERT with multiple values? We build up INSERTS on our clients and insert values 50 or so at a time.

HT: For most users, group commit being broken in 5.1 makes no difference. If you have a battery-backed disk controller cache, then the commit returns in less than 100 microseconds. And if you do not have, then the only way to get good performance is to set innodb_flush_log_at_trx_commit=2, that is, to flush the log to the disk only once per second.

PZ: We’ve helped number of users who had serve regressions in 5.0 because of group commit being broken. Most do not have RAID with BBU. With good RAID with BBU you can get 2000 fsync/sec or more which is typically enough.

Q23: Rumor has it that if you disable the new XA support (which we don’t use) then you’ll get group commit again.

HT: I do not think so. Sergei Golubchik serialized the MySQL binlog write and the InnoDB log flush with a mutex in 5.1, to implement XA, and switching off the XA will not remove that serialization. MySQL’s binlog and InnoDB’s log must have the transactions in the same order, for a recovery based on MySQL’s binlog to work.

Of course, if you remove that mutex from 5.1, then InnoDB’s group commit works again.

PZ: You can disable binary logging to get Group Commit back. Though this is often not the option. Here are some benchmarks we did.

Q24: INNODB has typically had problems scaling on multicore boxes. The new quad core CPUs from AMD and Intel means that 8-core boxes are going to be common place. We’re considering buying 40 8-core boxes with 32G of memory. Have any specific thoughts here? INNODB was originally written on single core CPUs.

HT: Users have reported lots of InnoDB scalability problems from multicore CPU’s. The most recent 5.0.xx behave better. We will continue the scaling improvements. Latest patches of Yasufumi Kinoshita attained very nice scaling up to 8 or 16 cores. These patches are not in the official InnoDB, though.

PZ: If your load is CPU bound and you’re scaling out you can get better performance by using couple of MySQL Servers on single node. You can make them to use different hard drives and bind to different CPUs.

Q25: In InnoDB, the referenced columns of a foreign key constraint need not form a primary key or a unique constraint: it is sufficient if the referenced columns form a consecutive set of leftmost columns of any index. Can you describe a real-world use case where it would be useful to have a foreign key not reference an entire primary key or unique constraint (something that is mandatory in every other RDBMS that suppors foreign key constraints)? Any concrete examples are appreciated.

HT: Yes, it is sufficient that a foreign key and the referenced key appear as the FIRST (leftmost) columns in an index, in the same order in the foreign key and the referenced key (this same order condition could be relaxed, though). The index is required for fast lookups: it can be used if the columns are the first columns in an index.

I think it is sound design practice to make the foreign key to reference an entire primary key, so that you can easily port your applications to other database brands. Maybe in some denormalized database (for example,
if ORDER and ORDERLINE tables are joined to form a single table), it might make sense for a foreign key to reference only a prefix of the primary key.

PZ: Thanks Heikki, Remember we’re waiting you to answer second portion of the questions.

UPDATE: Second Part of the answers is now available

About Peter Zaitsev

Peter managed the High Performance Group within MySQL until 2006, when he founded Percona. Peter has a Master's Degree in Computer Science and is an expert in database kernels, computer hardware, and application scaling.

Comments

  1. Odin says:

    Cool post, thanks Heikki.

    I have a 8 core system, so these patches mentioned (from Yasufumi Kinoshita); are they publically available?

  2. Thanks Peter for continuing to document the little known aspects of InnoDB. InnoDB is great in many ways, but much of this can only be found out by reading the code or asking InnoDB developers. I recommend reading the code. It is clean and well-documented.

    From what I read in the code, dirty pages are not flushed in strict LSN order during fuzzy checkpoints. When a dirty page is found to flush, InnoDB checks whether any pages for nearby disk blocks are also dirty and schedules those to be written as well. This sould decrease the random IO done by checkpoints.

  3. peter says:

    Mark,

    Indeed we frequently look into Innodb code ourselves and it is very well commented.

    You’re correct when Innodb does a flush it always tries to check if there are dirty pages n+1 or n-1 blocks and flush them as well because it is number of IOs what really counts.

    However it would not reorder IOs or keep into account situation when n and n+3 for example have relatively close LSN and n+3 would also need to be flushed the next round anyway.

    The last optimization I mention is if you say have pages n and n+2 dirty while page n+1 is clean in the buffer pool it may be cheaper to write (n,n+1,n+2) in the single write than have two writes plus seek.

    Another interesting question which I should probably check is if there is any reordering happens during double write process, ie 128 pages are written to boulewrite buffer – are they sorted then for writing to original locations or not ? In case of buffering IO OS will reorder them anyway but for O_DIRECT it may matter. Again some room for experiments :)

  4. Bill says:

    Thanks Peter and Hekki,

    I do have a follow up question to a comment Peter had on scaling out.

    PZ: If your load is CPU bound and you’re scaling out you can get better performance by using couple of MySQL Servers on single node. You can make them to use different hard drives and bind to different CPUs.

    What set up are you referring to? multiple master replication? federating tables? How would you recommend making the servers work together?

  5. peter says:

    Bill, This is mainly about scale out, in which case you can have 10 servers (MySQL Instances) or 20 it does not matter, you just chop your data in smaller pieces.

    In such case it may make sense to run couple of MySQL Servers on each physical node instead of only one (each with its own database etc)

  6. Bill,

    Your comment about running multiple MySQL daemons per box is interesting.

    I actually blogged about this a while ago:

    http://feedblog.org/2007/09/23/distributed-computing-fallacy-9/

    The theory being that a quad core machine with 32G of memory and 4 disks should really be considered 4 logical machines and not one.

    Also, I blogged my thoughts about this post here:

    http://feedblog.org/2007/10/29/thoughts-on-innodb-internals-re-heikki-tuur/

  7. Sergei Golubchik says:

    Kevin, you wrote as a comment to Q7:

    “Note that most database systems like MyISAM are very vulnerable to this problem.”

    MyISAM is not. At least not, as long as MySQL is concerned (filesystem caches are vulnerable, of course).
    There are two protections in MyISAM. First – keycache consists of warm and hot parts. Pages are cached in the warm part, and moved to the hot part after receiving a certain number of hits. Each part has its own LRU list. That is a full index scan will only wipe out the warm part of the keycache. Pages that are used most often and that are most important to cache will stay cached in the hot cache. Read more here: http://dev.mysql.com/doc/refman/5.0/en/midpoint-insertion.html

    A second protection: multiple key caches. A table may be assigned to a separate keycache, whatever you’ll do with it it will never affect cached pages of other tables. More here: http://dev.mysql.com/doc/refman/5.0/en/multiple-key-caches.html

  8. Hey Sergei,

    You’re right with regard to the key buffer but I was specifically thinking about the filesystem cache.

    There’s now way for the filesystem cache to know what’s going on in the application layer.

    I actually think there needs to be more coordination here between abstractions.

    For example, if the OS is running out of memory and about to swap it could send a signal to user level applications to tell them to free up their own internal caches to return some memory to the OS.

    Kevin

  9. peter says:

    Kevin, Sergei

    Indeed multiple keycaches and midpoint insertions help for MyISAM. Though I would like to see some experiments to see how well midpoint insertion is able to preserve cache from being washed away by background load.

    The file cache for MyISAM is however serious issue and not only that but Disk Scheduling which in many cases gives way too much preference to large size read/writes which batch jobs may do. At least I have not found IO scheduler yet which would not behave such way.

  10. Kunal Jain says:

    I have a QuadCore Server(4 Cores). How can i configure mysql to use all the four cores. Right now mysql process use only one core whose utilization goes upto 95%.

    Thanks in advance
    Kunal Jain

  11. bnm says:

    ciao
    c hanno parlato d te…allora ki siamo

  12. Jason Plant says:

    christ, your english is horrible.

  13. mustafa vaghjipurwala says:

    how to display the second portion
    in mysql
    plzz help

  14. Rick James says:

    Peter, this Q&A is 5 years old. Some things are stale to the point of being wrong. Could you update it or annotate the improvements?

    * Table scan does not necessarily flush the entire buffer_pool.

    * Various formats (COMPACT/…) have impact on what HT said.

    * “200 is generally good” — not on SSDs?

    * “Huge pages” have not caught on — maybe there is a reason?

    * Current status of “group commit”.

    * XtraDb enhancements / fixes.

    etc.

  15. Rick,

    Yes…. it is old and things are getting outdates as with many articles on this blog. This is limitation of the blog format… It is impossible to update all articles. I am speaking in a lot more details on this topic in my Innodb and XtraDB Architecture and Performance Optimization tutorial which I update with recent MySQL developments.

Speak Your Mind

*