Heikki Tuuri to answer your in depth Innodb questions

Have you ever had a question about Innodb internal design or behavior which is not well covered in MySQL manual ? I surely had.
Now you have a great chance to have them answered !

Heikki Tuuri, Creator of Innodb will answer your Questions about Innodb at MySQL Performance Blog.

Please leave your questions as comments to this post by 5th of October and I will pass them to Heikki to reply merging with questions I have myself.

Note: due to Oracle policies Heikki will likely be unable to answer your questions about Innodb new features or timetables.

Share this post

Comments (51)

  • Alexander Rubin

    How InnoBD handles blobs/text fields (needs more info)

    According to the documentation, InnoDB put first 768 chars of each text/blob in the page
    and will allocate some space outside of the page.

    1. if the total size of the row is less than 8100 bytes InnoDB will not allocate additional
    space for blobs, even if each blob is larger than 768.
    2. InnoDB will allocate additional space outside of the page for _each_ blob (if we have 8
    blobs 8xN bytes will be allocated)

    How much space InnoDB allocates for each blob outside of the page?

    Test case in: http://bugs.mysql.com/bug.php?id=29042

    September 28, 2007 at 10:33 am
  • mike

    How does MySQL Cluster help with big tables and lots of write activity?
    Especially how does it help with the foreground log i/o writes? How can
    MySQL Cluster spread out log i/o writes, etc…?

    September 28, 2007 at 11:12 am
  • mike

    Is there any advantage to using prepared statements in Java for DML in the following scenario:
    get a connection, prepare a statement, substitute the variable value, execute, close the statement –
    repeat this process over and over.

    Don’t you need to not close so that you don’t have to re-prepare the statement
    in order to gain any performance advantage?


    September 28, 2007 at 11:38 am
  • peter


    Heikki is Innodb expert, not MySQL Cluster or prepared questions so it is best if you can focus your questions on Innodb 🙂

    September 28, 2007 at 11:41 am
  • mike

    Sorry. I’ll post those elsewhere.

    Innodb Q: Is Group commit still busted in 5.0/5.1?

    September 28, 2007 at 12:04 pm
  • Kevin Burton

    Mike…. RE #3

    You should use a connection pool to avoid opening and closing lots of connections.

    September 28, 2007 at 3:16 pm
  • mike

    Hi Kevin,

    Using a connection pool. Unless you don’t close the statement, its my belief
    that prepared statements don’t buy you anything. The issue is opening and closing
    the statement defeats the purpose of a prepared statement.

    Anyway, I moved the question to the MySQL forum here, so we don’t plug up the intended
    use of these comments.

    September 28, 2007 at 3:26 pm
  • Kevin Burton

    INNODB uses it’s own internal buffer pool but it’s in conflict with the Linux kernel’s buffer pool.

    In high load situations where INNODB is using most of the system’s memory the kernel can decide (incorrectly) to swap out MySQL.

    There are two solutions here. Memlock and O_DIRECT.

    Memlock is apparently unstable on Linux.

    O_DIRECT is an alternative but Linus hates it.

    What are your suggestions here?

    September 28, 2007 at 4:10 pm
  • Kevin Burton

    Just an FYI for readers at home. The –memlock command line option calls the memlock system call to keep the process in memory.

    O_DIRECT bypasses the kernel to do IO directly. I’m not sure if it works either for keeping the process locked into memory.

    September 28, 2007 at 4:15 pm
  • Kevin Burton

    We’ve migrated to using INNODB so that it operates 100% out of memory. This way
    every write is serial so that we can see the full write speed of the disk for

    INNODB uses fuzzy checkpointing to commit data to disk. The problem that we’ve
    seen in high IO scenearios is that it commits too often and we’re only seeing
    33% of the raw disk write speed. Since the commit isn’t perfectly serial it’s
    having to seek on disk which slows down throughput.

    If we could control the checkpointing rate we could use an 8G write ahead log
    and tell innodb to do a full write of the database (basically dump memory to
    disk) once every minute or two.

    This way we’d see a 3x performance boost and it would write at 100MBps vs

    Any plans to enable tuning of the checkpointing rate? Postgres exposes this
    data and allows the user to tune the checkpointing values.

    September 28, 2007 at 4:23 pm
  • Kevin Burton

    What’s the status of INNODB in 5.1.x?


    What’s the current status of innodb_thread_concurrency ? Is it suggested to set
    this value to a LARGE value (somewhat like 200) for additional throughput?

    Brian Aker did some interesting work on this:


    And it looks like 5.1.22 has the necessary patches.

    Also, What’s the deal with group commit being broken in 5.1.x? Is it possible
    to get the same benefit if you’re performing INSERT with multiple values? We
    build up INSERTS on our clients and insert values 50 or so at a time.

    Rumor has it that if you disable the new XA support (which we don’t use) then
    you’ll get group commit again.

    September 28, 2007 at 4:44 pm
  • Kevin Burton

    INNODB has typically had problems scaling on multicore boxes. The new quad core CPUs from AMD and Intel means that 8-core boxes are going to be common place.

    We’re considering buying 40 8-core boxes with 32G of memory.

    Have any specific thoughts here? INNODB was originally written on single core CPUs.

    The other day I blogged about how this could be considered another distributed computing fallacy:


    September 28, 2007 at 4:46 pm
  • Roland Bouman

    In InnoDB, the referenced columns of a unique constraint need not form a primary key or a unique constraint: it is sufficient if the referenced columns form a consecutive set of leftmost columns of any index.

    Can you describe a real-world use case where it would be useful to have a foreign key not reference an entire primary key or unique constraint (something that is mandatory in every other RDBMS that suppors foreign key constraints)? Any concrete examples are appreciated.

    Thanks in advance,


    September 28, 2007 at 5:10 pm
  • Roland Bouman

    gah…I meant: “the columns referenced by a *foreign key constraint* need not form a primary key or a unique constraint”. Hope the question is clear now.

    September 28, 2007 at 5:13 pm
  • mfc

    If my workload has a hi amount of concurrent writes to about 6 tables,
    what version of Innodb/MySQL would you recommend? I’ve been staying at
    4.1.14 because of all the bad things I’ve heard about loss of performance
    with 5.0.X. I’m hopeful that 5.1.X will be better, but still not sure.

    Follow up question:

    My impression (from an external user of Innodb point of view) is that performance bugs
    don’t receive very hi priority. Yet, a lot of us are using MySQL because it seems to offer
    the neccessary performance. Are we just kidding ourselves, and should we switch to Oracle?

    September 28, 2007 at 7:23 pm
  • mike

    Would doubling the amount of RAM from 8Gig to 16Gig help the write performance of Innodb
    on Solaris?


    September 28, 2007 at 7:43 pm
  • mike

    Question about using solid state disks for storing ib_log* files

    I know that innodb writes files in the log file group in
    a round robin fashion. I also know that there is some writing that always occurs
    to the first file in a log file group. My questions are:

    1) What is always being written to the first file in the log file group?
    2) How often or what is the pattern of these first log file writes?
    3) Would they always occur to the same memory location(s)?
    4) Do they occur for every log entry?

    I’m just trying to see if there will be some special interaction between these
    always occurring writes to the first file in the log file group and the
    SSD dynamic wear leveling algorithm. I’m also wondering if you would recommend
    using SSD for this purpose, and how much performance gain there might be.

    September 29, 2007 at 9:48 am
  • Sigurd Urdahl

    Hi Heikki,

    I believe this is a somewhat trivial question, but it’s hard to find a good answer to it while googling.

    “What is the good way to split memory between INNODB buffers and OS?”

    Of course this depends on all kinds of things, but the only rule of thumb I seem to find is “INNODB should get around 70-80%” of the available RAM. I would guess that the number varied more, depending on the availability of RAM? On a dedicated server for Mysql you will still need some memory for the OS and the various stuff that will be running (monitoring, backup, cron, syslog, etc). But the ratio of RAM used for these non-mysql processes drops rapidly when you increase the amount of system memory.

    We just deployed a new Linux/Mysql server for a client, upgrading their main database (from 16GB, 2 two-year old HT’ed XEONs to 32 GB, 4x quad-core AMD Opterons. Yes it helped:-). At the moment the server has around 21 GB of memory used for applications (of which Mysql is the main one with “innodb_buffer_pool_size=20480M”) while the bulk of the rest (9 GB) is used for the OS cache (yes I know, we are below 70-80% already)

    So, is the general recomondation also for boxes with larger amounts of RAM, to give INNODB ~70-80% of the total RAM, and leave the rest to the OS cache? Is there special options or tricks that should be implemented on larger servers?

    kind regards,

    October 1, 2007 at 8:21 am
  • Kevin Burton


    Maybe this will make Heikki’s job easier 🙂

    Don’t give anything to the OS cache. It’s redundant with the innodb buffer pool.

    Instead just give as MUCH memory as possible to INNNODB. 100% if possible.


    October 1, 2007 at 11:41 pm
  • Kevin Burton

    Here’s another question.

    Wouldn’t a good feature be the ability to open the write ahead logs as O_DIRECT? I have a feeling it’s putting pressure on the VM by having pages enter the cache and thus putting some of the buffer pool into the page file.

    Opening the binary logs with O_DIRECT would fix this problem.


    October 1, 2007 at 11:42 pm
  • Sigurd Urdahl

    (in response to comment #19)


    thanks for the answer, but it was maybe a bit more to the point than I was hoping for:-)

    Are you saying that the 70-80%-rule is wrong, or that the general rule should be “keep enough free memory for anything else to run comfortably, and then give the rest to INNODB”?

    kind regards,

    October 2, 2007 at 1:56 am
  • Peter Bach

    There have been several MySQL bugs opened about multi-core scalability (concurrent queries, autoincrement, concurrent inserts, instrumentation, etc.). Rather than treat each defect as an individual case, is there any performance testing going on to quantify what are the limits of different operations for different levels of concurrency, specifically around mutexes and contention? How does this fit into the development process?

    October 2, 2007 at 8:49 am
  • Peter Bach

    We have a fairly large Innodb table (150GB) that is showing poor performance for full table scans. using O_DIRECT, we are seeing individual 16KB paged I/O going out one at a time, that are esentially reads misses (10ms each). The reads are not excactly sequential, there is a small gap between each.

    it is difficult to figure our what the problem is. We can OPTIMIZE the table, but how long will this take? Is there a facility we can run to show any table / index fragmentation, without having to run the OPTIMIZE?

    October 2, 2007 at 8:54 am
  • Kevin Burton


    Regarding #23… if you’re using code to do a full table scan try to scan it by using ORDER BY based on the primary key.

    Since INNODB is clustered you’ll be able to do sequential reads across the WHOLE table.

    Of course at 150G and 100MBps (assuming this is the serial IO rate of your disk subsystem) it will still take 25 minutes to process the whole table.

    Note that this is the theoretical maximum performance you’ll ever be able to achieve.


    October 2, 2007 at 2:07 pm
  • Kevin Burton


    Regarding #22… you’re right. Multi-core scalability is a BIG problem that I’d like to see addressed with INNODB.

    BTW……. I think MySQL should assume machines will have 8 cores from here on out. They’re REALLY cheap. You can get quad cores from AMD + Intel now and put two chips per box.

    An 8 core machine with 16G of memory is only abotu $4k now.

    October 2, 2007 at 2:09 pm
  • Kevin Burton


    Regarding #21

    …. Yes……

    “{the general rule should be “keep enough free memory for anything else to run comfortably, and then give the rest to INNODB”

    Ideally you wouldn’t run anything else but if you do just allocate that enough memory and then give the rest to INNODB.


    October 2, 2007 at 2:10 pm
  • Kevin Burton

    Another question.

    Why can’t I open my log files up as O_DIRECT? I think I’m seeing cache pressure on the Linux VM which is causing the buffer pool to dive into the page file.

    Memlock works to fix it of course but there’s no reason to cache the log file I think.


    October 2, 2007 at 2:11 pm
  • Alexander Rubin


    You can tune swappiness on the OS level.
    Depending on the VM configuration Linux will either shrink the buffer cache or swap programs out.

    Using “echo 0 > /proc/sys/vm/swappiness” the VM strategy is being set to “always shrink the buffer cache”. To make the setting persistent, add

    vm.swappiness = 0

    to /etc/sysctl.conf.

    The value is a percentage. By setting it to 100, Linux will always swap programs out and never shrink the buffer cache.

    October 2, 2007 at 5:19 pm
  • Kevin Burton

    Hey Alexander.

    I’m aware of the swappiness setting.

    In fact my blog is #3 in google for the query “swappiness”

    … that said. It doesn’t work.

    I blogged about it here:



    October 2, 2007 at 5:35 pm
  • mike


    With a 1.5 TB database on 4.1.14 consisting mostly of Innodb tables, I’m looking at upgrading to 5.1.X
    as soon as it becomes general release. I can’t afford much downtime and I know there is reformatting
    involved to go to the next release.

    The manual recommends going from 4.1 to 5.0 and 5.0 to 5.1 but that sounds like several reformatting steps.
    With a large database, that means lots of downtime.

    How can I avoid as much reformatting as possible and still move to 5.1.X? Can you give me the name of
    someone who you consider a expert on migration so I can discuss this? My email is mikefconnell@verizon.net


    October 2, 2007 at 10:57 pm
  • Pavel Francirek

    Why does InnoDB start to loose space after some amount of queries (usually every few weeks at thousand q/s traffic) and only restart of database helps?

    October 3, 2007 at 12:55 am
  • Morgan Tocker

    Pavel: I would assume it’s your purge thread lagging in InnoDB if restarting improves performance. Check your processlist and see if you have a long running query.

    October 3, 2007 at 2:15 am
  • Alexander Rubin

    If you will make a dump and restore then you do not need to do 4.1->5.0->5.1
    However, dump/restore of 1.5TB is very slow. You can try running dump in parallel for many table.
    For migrating from 4.1 to 5.1 you can also try starting MySQL 5.1/innodb on 4.1 tablespace.

    I’ve done migration of 1.3TB from 4.0 to 5.0 and we have to do dump/restore (too many changes from 4.0 to 4.1). There are actually less changes from 4.1 to 5.0.

    To minimize downtime you can use a slave server. Setup a slave and make sure it is uptodate with master, perform migration on slave, wait until slave will catch up again, test slave and move to the slave (so slave will now be your main/master box)

    October 3, 2007 at 3:53 am
  • Alexander Rubin

    Question to Heikki

    Seems that adaptive hash indexes are not used for primary key lookups or not showed up in “show innodb status”.
    select * from table where id=

    show innodb status:

    Hash table size 10624987, used cells 1, node heap has 1 buffer(s) 0.00 hash searches/s, 15979.01 non-hash searches/s

    October 3, 2007 at 4:04 am
  • mike

    Re: 33

    To sharpen the migration question more…can a 4.1.14 database be used without problem
    agains a 5.1.X code base? I really need to avoid as much downtime as possible even
    if it is done with parallel dumps and imports. Several tables over 500M+ rows.



    October 3, 2007 at 12:25 pm
  • Harrison


    Re: 27

    The reason you can’t use log files with O_DIRECT is due to how the log file is written. O_DIRECT has a lot of restrictions on alignment, size of requests, etc… The InnoDB tablespace uses 16k pages, which happens to match up perfectly with all of those restrictions. The REDO log is not aligned or padded in anyway to allow this.

    In theory, you could rework how the log file is written and make it obey all of those rules, but it is certainly much more complex than it initially sounds of just using an O_DIRECT flag.

    – Harrison

    October 3, 2007 at 1:57 pm
  • Kevin Burton

    Hey Harrison.

    Fair enough. In fact if this is a VM bug then it would be better to either fix this or disable OS level paging or restrict it to less than 100% of the memory.

    The problem I was having is that Linux decides to page even though the whole DB fits in memory and there’s plenty of it available.

    RHEL actually has patches which allow you to allocate less than 100% of system memory to cache which was used by tuners running Oracle on RHEL.

    For now though memlock actually works fine.


    October 3, 2007 at 2:19 pm
  • Alexander Rubin

    Re: 35


    If you will do dump/restore there will be no problems with binary incompatibilities (as you will recreate the tables), except for the reserved words.

    New reserved words were introduced in 5.0 and 5.1:

    If your table names/column names in ver. 4.1 uses reserved words that can be a problem.
    To test you can dump schema only in ver. 4.1 and restore it to the additional 5.1 server.

    Do you want to migrate to MySQL 5.1 (it is not released yet) because of the partitioning?

    Again, using slave server will help decrease downtime, as you will be able to write to the master even during the migration process. 1.5TB dump/restore – this will take a long time, may be couple of weeks.

    Hope this will help

    October 3, 2007 at 7:16 pm
  • mike

    Re: 38

    Hi Alexander,

    I can afford maybe 24 hours max. Thats why I don’t want to

    Reasons to migrate are:

    1) Be on latest version
    2) Hopefully 5.1.X has better performance due to:
    a) O_DIRECT works in 5.1 on Solaris
    b) 5.1 eliminates concurrency problem with auto-inc
    c) Partitioning
    3) My boss wants to

    So not real strong reasons. I can stay on 4.1.14 if I
    have to.

    The master/slave idea sounds interesting, can you send me
    email so we can discuss offline?

    October 3, 2007 at 9:34 pm
  • peter

    OK. Combined the questions and added some 10 more from myself and sent to Heikki to reply.

    October 6, 2007 at 12:52 pm
  • mike

    I know that SELECT COUNT(*) or COUNT(column) from InnoDB is not optimized because of how InnoDB does row versioning or something it does not keep track of the row count. So people suggest using a separate table with table_name/num_rows. However, couldn’t this easily be fixed in MySQL – if table handler is InnoDB, a select count() (with no WHERE clause, obviously) could read the row count from INFORMATION_SCHEMA or maybe just another internal variable? Was just thinking about this earlier today, thought I would pose it as a question.

    October 8, 2007 at 5:18 pm
  • Alexander Rubin

    re: 41

    Mike, INFORMATION_SCHEMA or show table status info is not 100% correct, it is approximate. And count(*) will give you exact number of rows in innodb (with the considiration of other transactions/repeatable read)

    October 8, 2007 at 6:57 pm
  • mike

    re: 42

    there could still be some other way to intercept that and give a reliable method without making separate tables and having to add application logic to update it right?

    October 8, 2007 at 7:06 pm
  • Alexander Rubin

    Re: 42

    That is a good question for Heikki, actually.

    October 10, 2007 at 9:47 am
  • Khaled

    Dear Sir,,,

    Will you please to help me for getting the two-phase locking algorithm pseudo code or the Intention lock algorithm pseudo code
    Thank you

    January 20, 2008 at 5:23 am
  • Kunal Jain

    I have a QuadCore Server(4 Cores). How can i configure mysql to use all the four cores. Right now mysql process use only one core whose utilization goes upto 95%.

    Thanks in advance
    Kunal Jain

    November 29, 2008 at 2:42 am