Its almost a month since I promised Heikki Tuuri to answer Innodb Questions. Heikki is a busy man so I got answers to only some of the questions but as people still poking me about this I decided to publish the answers I have so far. Plus we may get some interesting follow up questions out of this.
I had added my comments to some of the questions. HT will stand for Heikki Tuuri in the answers and PZ for myself.
Q1: Why did you decide to go with gzip compression instead of implementing “classical” approach of index prefix compression ?
HT: The reason is that index prefix compression prevents one from using a binary search on a B-tree node. We could not compare a search key K to an index record R if we only knew the suffix of R. There are, of course, hybrid approaches where some prefix information is stored, and a search almost as efficient as a binary search can be performed.
But InnoDB’s adaptive hash indexes require that the prefix is stored in every record R. An adaptive hash index tries to guess where our search would end in the B-tree. To verify that we arrived at the right position, we need to know the whole record R.
Also, a gzip compression potentially compresses the database more than just prefix compression. The downside in gzip is increased usage of CPU time.
PZ: We will see how it plays when we can see more benchmarks. The big problems I see so far with this feature are ease of use and requirement to “guess” how your data will compress for the whole table. Among other things compression for data pages and uncompressed index pages can be significantly different.
Q2: Does Innodb has pointers to the next/previous page in index tree leaf pages ?
PZ: It is great as it can help to make full table scans and index scans more efficient. I can’t wait for ability to create physically sorted indexes with Innodb (with index built by sort) this is when this should be helpful the most.
Q3: Does Innodb secondary Indexes trees are allocated in two extents same as primary key tree or stored in the same extent ?
HT: An InnoDB extent is a contiguous block of 64 pages, 16 kB each. Each B-tree index in InnoDB occupies two SEGMENTs. One segment is for the non-leaf nodes, and the other for leaf nodes. The first 32 pages of a segment are allocated from ‘fragment extents’ where any individual page can be allocated to whatever use in whatever segment. After that, a segment always reserves whole 64 page extents.
Thus, the answer to the question is: for small tables, everything can reside within a single extent. For big tables, a secondary index reserves different extents from the clustered index.
Q4: Does Innodb ever merges sequential pages together if they become almost empty
HT: Yes. If a page becomes less than half full, a merge is attempted.
/* In the pessimistic delete, if the page data size drops below this
limit, merging it to a neighbor is tried */
#define BTR_CUR_PAGE_COMPRESS_LIMIT (UNIV_PAGE_SIZE / 2)
Q5: When Innodb free space becomes available for use within same object? When does Innodb make it available for other objects.
HT: Free space can always be recycled within a segment. A segment can release free space to other segments, if a 64 page extent becomes totally empty, and also it can release individual pages from its 32 ‘fragment extent’ pages.
PZ: This is the reason why when you delete a lot of scattered rows you may not see a “free space” in Innodb tablespace to grow significantly.
Q6: Does Innodb policy replacement algorithm takes into account page position in btree index ?
HT: No. It is a pure LRU algorithm.
PZ: This is one of the things which we would like to play with – I would expect serious improvements are possible especially on high IO pressure workloads.
Q7: Does Innodb has any protection from pages being overwritten in buffer pool by large full table scan
PZ: Another possible area of optimization. I frequently see batch jobs killing server performance overtaking buffer pool. Though full table scan is only one of replacement policy optimizations possible.
Q8: How do you assess current state of Innodb scalability with multiple threads and multiple concurrent transactions ?
HT: The scalability in my opinion is ‘mediocre’, and we are working on improving it.
PZ: Thank you for honest acknowledgment and I can’t wait to get hold of further improved versions.
Q9: Do you favour database “self tuning” or prefer to expose tuning options to be available to the user.
HT: I favor self tuning.
PZ: I also like self tuning in theory. But I really like to have tools to steer the database behavior in the cases when self tuning gets it wrong.
Q10: When Innodb decides to schedule sequential read-ahead, random read ahead ?
HT: InnoDB schedules a sequential read-ahead when it notices an ascending access pattern that passes an extent boundary. Roughly, if most of the pages in the extent have been accessed, and in an ascending order of the
file address, then InnoDB schedules the read of all the 64 pages of the extent where the next index page in the ascending order is. InnoDB has a similar heuristic for descending index scans.
The InnoDB sequential read-ahead is not optimal at all. It should schedule the read of the next extent when we are in the middle of the processing of the previous extent.
Random read-ahead means that when InnoDB notices that if at least 13 pages in an extent have been accessed very recently, then InnoDB reads in all the rest of the pages in the extent.
PZ: Interesting enough, there was some work done by DIKU students improving algorithms used for read ahead and results were significant. They however were not merged in Innodb mainline yet.
Q11: If read-ahead is happening and Innodb needs to access one of the pages being fetched does it has to wait for whole read ahead request to complete
HT: Unfortunately, it often has to wait. A read-ahead of a 64 page extent is
often done as a single 1 MB file read.
PZ: “often” a bit stikes me here. I thought it would be always single read, typically 1MB in size. I should also mention a post on this topic once.
Q12: If read-ahead request was placed to the queue but did not start executing yet will innodb has to wait for it to complete or can just go ahead and read page it needs
HT: InnoDB has to wait.
PZ: This one is a bummer. I thought Innodb will make sure to schedule synchronous wait in front if read-ahead IO has not been started already. The interesting question it brings – how many read-ahead requests can be queued at once.
Q13: Is there any IO concurrency control – how many random and sequential read aheads can be happening at the same time
HT: All read-aheads are performed in the Unix version with a single thread. Only one read-ahead can be happening on Unix at a time. On Windows, InnoDB uses native async I/O, and can perform many read-aheads at the same time.
PZ: Another thing I’d like to experiment with. I would expect you can gain significantly with multiple IO threads on systems with large amount of hard drives. Google has Patches which let you to try it out.
Q14: Are there any plans to fix Innodb so it unlocks (or never locks) the rows if they were not matched by query where clause when it was not resolved by index
HT: That is already implemented in 5.1: if you set the transaction isolation level READ COMMITTED, then InnoDB normally does not lock the ‘gaps’, and it does not lock rows whose latest committed version does not match the
WHERE condition in a search. Please use this with care: you MUST use row-based binlogging and replication, and remember that the execution is not serializable.
PZ: Cool. This is one of little known new features in MySQL 5.1
Q15: How frequently does Innodb fuzzy checkpointing is activated
HT: InnoDB flushes about 128 dirty pages per flush. That means that under a heavy write load, a new flush and a checkpoint happens more than once per second.
PZ: Not what this answers question exactly. So is it activated as soon as 128 pages must be flushed ? If so how this is discovered when. I think fuzzy checkpointing one of little known aspects of Innodb operations while I see it causing problems every so often.
Q16: How Innodb decided how many pages to flush at each checkpoint interval
HT: Usually it is 128, or less.
PZ: Looks like another magic number to experiment with.
Q17: How InnoBD handles blobs/text fields (needs more info)
According to the documentation, InnoDB put first 768 chars of each text/blob in the page and will allocate some space outside of the page.
1. if the total size of the row is less than 8100 bytes InnoDB will not allocate additional space for blobs, even if each blob is larger than 768.
2. InnoDB will allocate additional space outside of the page for _each_ blob (if we have 8 blobs 8xN bytes will be allocated)
How much space InnoDB allocates for each blob outside of the page?
HT: For each column that InnoDB needs to store ‘externally’, it allocates
whole 16 kB pages. That will cause a lot of waste of space if the fields
are less that 16 kB.
The ‘zip’ source code tree by Marko has removed most of the 768 byte
local storage in the record. In that source code tree, InnoDB only needs
to store locally a prefix of an indexed column.
PZ: I think it is also very interesting question what happens for blobs larger than 16K – is exact size allocated or also segment based allocation is used.
Q18: Innodb Q: Is Group commit still busted in 5.0/5.1?
HT: Yes. However, work has been done on it by Inaam.
Q19: INNODB uses itâ€™s own internal buffer pool but itâ€™s in conflict with the Linux kernelâ€™s buffer pool. In high load situations where INNODB is using most of the systemâ€™s memory the kernel can decide (incorrectly) to swap out MySQL. There are two solutions here. Memlock and O_DIRECT.
Memlock is apparently unstable on Linux. O_DIRECT is an alternative but Linus hates it. What are your suggestions here?
HT: According to tests by Peter Zaitsev, O_DIRECT works well to remove double buffering. In distant future, I would like to have the memlock also.
PZ: memlock already works – it is global MySQL Server option, though it may have problems on Linux if you lock very large portion of physical memory. There is also third alternative – “huge pages” which can be used for buffer pool and which are not swappable. They also allow to reduce number of TLB cache misses. huge pages along however do not stop IO pressure and OS may swap out not buffer pool but some other important parts of MySQL process.
Q20: Weâ€™ve migrated to using INNODB so that it operates 100% out of memory. This way every write is serial so that we can see the full write speed of the disk for throughput.
INNODB uses fuzzy checkpointing to commit data to disk. The problem that weâ€™ve seen in high IO scenearios is that it commits too often and weâ€™re only seeing 33% of the raw disk write speed. Since the commit isnâ€™t perfectly serial itâ€™s having to seek on disk which slows down throughput.
If we could control the checkpointing rate we could use an 8G write ahead log and tell innodb to do a full write of the database (basically dump memory to disk) once every minute or two.
This way weâ€™d see a 3x performance boost and it would write at 100MBps vs 33MBps.
Any plans to enable tuning of the checkpointing rate? Postgres exposes this data and allows the user to tune the checkpointing values.
HT: Hmm… we could tune the way InnoDB does the buffer pool flush. I think Yasufumi Kinoshita talked at Users’ Conference 2007 about his patch that makes InnoDB’s flushes smoother and increase performance substantially.
I assume there is lots of room to tune the flushes, since I never optimized the algorithm under a realistic workload.
Making the doublewrite buffer bigger than 128 pages would require a bit more work. Now it is allocated permanently in the system tablespace when an InnoDB instance is created.
PZ: Sequential “dump” of buffer pool would need more optimizations to work – Innodb would also need also to write not-modified pages if this will get sequential write. Say if we have 1MB segment with all pages in buffer pool it well may be faster to do 1 write than flush 10 non sequential pages from it which were modified.
Q21: Whatâ€™s the status of INNODB in 5.1.x?
Specifically: Whatâ€™s the current status of innodb_thread_concurrency ? Is it suggested to set this value to a LARGE value (somewhat like 200) for additional throughput?
HT: ha_innodb.cc in 5.1:
static MYSQL_SYSVAR_ULONG(thread_concurrency, srv_thread_concurrency, PLUGIN_VAR_RQCMDARG,
"Helps in performance tuning in heavily concurrent environments. Sets the max\
imum number of threads allowed inside InnoDB. Value 0 will disable the thread t\
hrottling.", NULL, NULL, 8, 0, 1000, 0);
The default is 8, which will work well in most cases. If you are not seeing ‘thread thrashing’ (lots of threads waiting for semaphores in SHOW INNODB STATUS), you can try to disable thread throttling completely
by setting the value 0.
But if you see thread thrashing, then a value 1 or 2 often solves your thrashing problem.
PZ: Values 1,2 may solve your thrashing problem but also will limit innodb to be able to use only couple of CPUs efficiently. Thought even that is often better than bad thrashing.
Q22: Also, Whatâ€™s the deal with group commit being broken in 5.1.x? Is it possible to get the same benefit if youâ€™re performing INSERT with multiple values? We build up INSERTS on our clients and insert values 50 or so at a time.
HT: For most users, group commit being broken in 5.1 makes no difference. If you have a battery-backed disk controller cache, then the commit returns in less than 100 microseconds. And if you do not have, then the only way to get good performance is to set innodb_flush_log_at_trx_commit=2, that is, to flush the log to the disk only once per second.
PZ: We’ve helped number of users who had serve regressions in 5.0 because of group commit being broken. Most do not have RAID with BBU. With good RAID with BBU you can get 2000 fsync/sec or more which is typically enough.
Q23: Rumor has it that if you disable the new XA support (which we donâ€™t use) then youâ€™ll get group commit again.
HT: I do not think so. Sergei Golubchik serialized the MySQL binlog write and the InnoDB log flush with a mutex in 5.1, to implement XA, and switching off the XA will not remove that serialization. MySQL’s binlog and InnoDB’s log must have the transactions in the same order, for a recovery based on MySQL’s binlog to work.
Of course, if you remove that mutex from 5.1, then InnoDB’s group commit works again.
PZ: You can disable binary logging to get Group Commit back. Though this is often not the option. Here are some benchmarks we did.
Q24: INNODB has typically had problems scaling on multicore boxes. The new quad core CPUs from AMD and Intel means that 8-core boxes are going to be common place. Weâ€™re considering buying 40 8-core boxes with 32G of memory. Have any specific thoughts here? INNODB was originally written on single core CPUs.
HT: Users have reported lots of InnoDB scalability problems from multicore CPU’s. The most recent 5.0.xx behave better. We will continue the scaling improvements. Latest patches of Yasufumi Kinoshita attained very nice scaling up to 8 or 16 cores. These patches are not in the official InnoDB, though.
PZ: If your load is CPU bound and you’re scaling out you can get better performance by using couple of MySQL Servers on single node. You can make them to use different hard drives and bind to different CPUs.
Q25: In InnoDB, the referenced columns of a foreign key constraint need not form a primary key or a unique constraint: it is sufficient if the referenced columns form a consecutive set of leftmost columns of any index. Can you describe a real-world use case where it would be useful to have a foreign key not reference an entire primary key or unique constraint (something that is mandatory in every other RDBMS that suppors foreign key constraints)? Any concrete examples are appreciated.
HT: Yes, it is sufficient that a foreign key and the referenced key appear as the FIRST (leftmost) columns in an index, in the same order in the foreign key and the referenced key (this same order condition could be relaxed, though). The index is required for fast lookups: it can be used if the columns are the first columns in an index.
I think it is sound design practice to make the foreign key to reference an entire primary key, so that you can easily port your applications to other database brands. Maybe in some denormalized database (for example,
if ORDER and ORDERLINE tables are joined to form a single table), it might make sense for a foreign key to reference only a prefix of the primary key.
PZ: Thanks Heikki, Remember we’re waiting you to answer second portion of the questions.
UPDATE: Second Part of the answers is now available