November 23, 2014

Adjusting Innodb for Memory resident workload

As larger and larger amount of memory become common (512GB is something you can fit into relatively commodity server this day) many customers select to build their application so all or most of their database (frequently Innodb) fits into memory.

If all tables fit in Innodb buffer pool the performance for reads will be quite good however writes will still suffer because Innodb will do a lot of random IO during fuzzy checkpoint operation which often will become bottleneck. This problem makes some customers not concerned with persistence run Innodb of ram drive

In fact with relatively simple changes Innodb could be made to perform much better for memory resident workloads and we should consider fixing these issues for XTRADB.

Preload It is possible to preload all innodb tables (ibdata, .ibd files) on the system start – this would avoid warmup problem and also make crash recovery fast even with very large log file – random IO is what usually limits recovery speed. Because files can be just read sequentially the read speed can be hundreds of megabytes per second even for commodity RAIDs.

Sequential Checkpointing Currently fuzzy checkpoint flushes pages which have not been flushed for longest amount of time which causes random IO. In resident checkpoint mode we should just flush everything (yes even clean pages) sequentially. This should allow to get sequential writes giving us 100MB+ of write speed – which means 256GB buffer pool can be flushed about once per 30 minutes. It should be possible to just size Innodb logs so they are not cycled through faster than flush cycle.

This may not be the most optimal solution if you design the system from scratch but it is something which can be done without changing Innodb core logic significantly or storing on disk storage format at all.

This is just an idea at this point which I’ve discussed with some customers and we’re not working on it yet, though if you think it is something which you think would help your performance challenges we would be happy to implement it for you.

About Peter Zaitsev

Peter managed the High Performance Group within MySQL until 2006, when he founded Percona. Peter has a Master's Degree in Computer Science and is an expert in database kernels, computer hardware, and application scaling.

Comments

  1. Psih says:

    That’s definitely should be implemented – I’m working on some social sites where databases isn’t in terabytes, but has very high read/write ratio, so very heavy random I/O. Ability to make a 128GB RAM server and fit effectively database to memory so that disk I/O won’t limit it’s performance will be awesome.

  2. Another approach is to update fuzzy checkpointing with more tunables.

    In reality, it only makes sense to write to disk at about 50MB/s since you’re also writing to the logs at 50MB/s since you would still have to recover.

    If you were to pause you could just write dirty blocks sequentially and jump over clean blocks.

    It would be pseudo random but faster to seek over 100MB than to write it all….

    Kevin

  3. srv says:

    Hello, peter.

    This is absolutely off-topic, but can you please give me some insights on the situation I just had? Recently we got a lot of threads in the process list on one of our dedicated MySQL servers. I’m not a DBA, but I know a little bit more about MySQL than most of the people here, so fixing this was up to me. cpu utilization, iowait, context switching (all via vmstat) and disk utilization (via iostat) were ok: ~20% for user+sys, 0-1% iowait, ~5% disk utilization. Data size was smaller than innodb_pool_buffer (almost all tables using InnoDB), and server still had a lot of free memory. So I conclude that this is not a resource bottleneck but maybe some concurrency issues. I tried to play with a innodb_thread_concurrency setting, but that did not help. After that I started to analyze queries (I’m not good with understanding complicated queries, I do not know the DB structure for this project, so I hoped to evade this process).

    I found out that there was a set of queries that hurt performance badly, all of them with “using intersect” in “Extra” and “index_merge” in “type” EXPLAIN fields. Other queries were fine. Further investigation showed that all tables used by this queries were 1-5 mln rows and poorly indexed with simple indexes on almost every field, most of them with low cardinality (< 100). I just dropped those indexes and queries started to execute in less then a second.

    We using MySQL 5.0.27 with no patches (except maybe for OS vendor specific — it`s RHEL 5.0). Binary log is on (this instance used as a master for replication process).

    What happened? Is this a knowing behavior of index_merge on this data/index set? Is this a bug? If it is, could you please help me to understand what caused it and fill a good bug report?

  4. peter says:

    Kevin,

    It is a tricky question if data which does not need to be flushed should be skipped. If these “holes” are large (say several MB in size) it makes sense to skip it. However if it is just random holes it makes sense to write all of them – for example if you look at 1MB block (64 pages) which only have random 16 pages dirty it will be faster to do one 1MB io rather than 16 individual IOs with seek in between. But well this is tiny easy optimization.

    Now regarding log vs data – these can hardly be compared. Data is written always in 16K pages even if very little part of the page was modified, log can be written in 512 byte blocks. In the worse case each single 512 byte log record can result in 16K dirty page which will make data write requirement to be 30x from the log write. In reality however there can be multiple changes to the page before it needs to be flushed so the difference is really not so bad.

    Note however even if you’re working with large blobs or something which will make log io close to data IO data IO will be double because of the double write.

  5. peter says:

    srv,

    Answering you here will send a signal to everyone polluting blog with irrelevant comments is a great way to get a reply and I do not want to support such behavior.
    Feel free to ask your question on our forums instead:
    http://forum.percona.com/

    You’re also surely welcome to hire us to assist you with any difficulties.

  6. One thing to note.

    Our warmup script has a LOT of random IO when warming up indexes.

    I’m currently doing a warmup right now and it’s probably going to take 30-60 minutes.

    I suspect that this could be boosted 10x if InnoDB had a native warmup method instead of having to ‘trick’ InnoDB to warmup by performing a custom query.

  7. peter says:

    Kevin,

    Right. This is one of the features in todo to have some command to “fetch” either main tablespace or bunch of .ibd files This is actually relatively easy one but should help a lot with warmup case likes yours having just sequential IO.

  8. This feature would definitely be a big compelling reason to run XtraDB :)

    Kevin

  9. Hey Peter and Vadim, this site is absolutely great. I wonder how come it’s the only good one of this kind. Database performance issues are all around, and yet I find almost all my answers here.

    Keep up the good work, guys!

  10. Yogesh says:

    how much memory should allocate to innodb_buffer_pool for 100 GB database?

Speak Your Mind

*