Where the open source community meets: Secure your spot for Percona Live Amsterdam! - Register

Downloads

Blog

Heikki Tuuri answers to Innodb questions, Part II

November 4, 2007

Author

Peter Zaitsev

Insight for Developers

Share this Post:

I now got answers to the second portions of the questions you asked Heikki. If you have not seen the first part it can be found here.

Same as during last time I will provide my comments for some of the answers under PZ and will use HT for original Heikkis answer.

Q26: You also say on Unix/Linux only one read-ahead can happen at the same time. How many read-aheads can be waiting in Queue when or Innodb will schedule more read-aheads only when given read-ahead is completed?

HT: A query thread normally posts about 64 page read requests per a readahead. The InnoDB buffer is 256 page read requests for the aio read thread. Thus, about 4 readaheads can be in the queue at a time.

The read aio thread will normally read all the 64 page read requests in a readahead, and then move to serve the next readahead.

PZ: Would be interesting to play with read-ahead depth here. These settings mean there can be quite significant stall waiting for the page which is being requested via read-ahead.

Q27 If my workload has a hi amount of concurrent writes to about 6 tables, what version of Innodb/MySQL would you recommend? Iâ€™ve been staying at 4.1.14 because of all the bad things Iâ€™ve heard about loss of performance with 5.0.X. Iâ€™m hopeful that 5.1.X will be better, but still not sure.

HT: 5.0.xx, where xx >= 30 should be better than 4.1.xx.

PZ: I should note MySQL 5.0 vs 4.1 has a lot of performance differences and not all of them are Innodb specific and it well can affect your workload. So benchmark to be sure. In Innodb besides scalability changes in later 5.0 series compact row format was implemented which can help performance (keeping data smaller) or can a bit slow it down. Plus there were bunch of micro optimizations in Innodb in 5.0 series. Also note the “Auto Increment Issue” is finally solved in 5.1 which may be even better.

Q28: My impression (from an external user of Innodb point of view) is that performance bugs donâ€™t receive very hi priority. Yet, a lot of us are using MySQL because it seems to offer the neccessary performance. Are we just kidding ourselves, and should we switch to Oracle?

HT: You are right, bugs that cause data corruption, crashes, or wrong query results get the highest priority. We are slowly working on the performance enhancement requests. It takes time for the patches to mature. We improved the scalability of InnoDB-5.0.30 in December 2006 with a large patch, and the ‘last’ serious bug in it was only recently found and fixed: http://bugs.mysql.com/bug.php?id=29560

Concerning a switch to Oracle’s main database, I believe different applications work best on different database brands. Oracle’s main database has a very competent team working on it.

Q29: Would doubling the amount of RAM from 8Gig to 16Gig help the write performance of Innodb on Solaris?

HT: Yes, if your ‘write working set’ fits in 16 GB. I mean, if the database buffer pool can buffer many writes to the same page, then fewer disk writes are needed. But if your write working set is, say, 100 GB, then increasing the buffer pool size does not help.

PZ: I should note for most application working set is a distribution rather than fixed number so you can say “fits” or “does not fit”. Depending on data access distribution gains will be different ranging from no gains to very significant ones. Also do not forget you may need to increase size of your Innodb log files to be able to use a lot of memory for write buffering efficiently.

Q30: Question about using solid state disks for storing ib_log* files; I know that innodb writes files in the log file group in a round robin fashion. I also know that there is some writing that always occurs to the first file in a log file group. My questions are: What is always being written to the first file in the log file group?

HT: InnoDB writes the checkpoint information to the first file.

Q31: How often or what is the pattern of these first log file writes?

HT: At every checkpoint. It may happen even more often than once per second if there are dirty pages in the buffer pool.

Q32: Would they always occur to the same memory location(s)?

HT: Yes.

Q33: Do they occur for every log entry?

HT: No.

Q34 Iâ€™m just trying to see if there will be some special interaction between these always occurring writes to the first file in the log file group and the SSD dynamic wear leveling algorithm. Iâ€™m also wondering if you would recommend using SSD for this purpose, and how much performance gain there might be.

HT: I guess the log writes themselves are much more a problem to the SDD wear leveling: if you commit 1000 transactions per second, then you write 1000 times per second!

A smart wear leveling should be able to cope with writes to the same file position. That kind of access is so common in files.

The wear leveling is used for flash memories. I looked up some information on the web about flash memories, and noticed that random writes to them are actually slower than to a mechanical disk. I do not know how it is for sequential writes, like log flushes.

A question is why to use a flash memory at all, when a battery-backed disk controller write-back cache does the same thing.

PZ: First Indeed writes are slower with Flash disks compared to Reads because Flash requires “erase” cycle to replace the data plus this erase can only happen in the blocks of significant size, say 64K.

But generally I think Log files is not something SSD can give the main benefit to – writes to log files are small and sequential which can be very well taken care of RAID Battery Backed Up cache and when sequentially written to the drives at the rate of 100MB/sec or more. This cache is DRAM based and so much faster than Flash.

SSD Drives would be big win for TableSpace files but these are much larger and so less used for the time being.

What would be really cool if someone would come up with “Persistent DRAM” memory device, which would offer some small amount, say 512M memory which is battery backed up just as RAID cache so it can handle system restarts. When putting log files on such device could give great logging performance because you no more have to go with SCSI interface both in hardware and software. I’m still surprised with databases being so much of commodity no one has something like it.

Q35: I believe this is a somewhat trivial question, but itâ€™s hard to find a good answer to it while googling.â€œWhat is the good way to split memory between INNODB buffers and OS?â€ Of course this depends on all kinds of things, but the only rule of thumb I seem to find is â€œINNODB should get around 70-80%â€ of the available RAM. I would guess that the number varied more, depending on the availability of RAM? On a dedicated server for Mysql you will still need some memory for the OS and the various stuff that will be running (monitoring, backup, cron, syslog, etc). But the ratio of RAM used for these non-mysql processes drops rapidly when you increase the amount of system memory. We just deployed a new Linux/Mysql server for a client, upgrading their main database (from 16GB, 2 two-year old HTâ€™ed XEONs to 32 GB, 4x quad-core AMD Opterons. Yes it helped:-). At the moment the server has around 21 GB of memory used for applications (of which Mysql is the main one with â€œinnodb_buffer_pool_size=20480Mâ€) while the bulk of the rest (9 GB) is used for the OS cache (yes I know, we are below 70-80% already) So, is the general recomendation also for boxes with larger amounts of RAM, to give INNODB ~70-80% of the total RAM, and leave the rest to the OS cache? Is there special options or tricks that should be implemented on larger servers?

HT: I think Linux often likes to use about 20 % of the available RAM as its file cache.

The ideal configuration in a dedicated server is: set innodb_flush_method=O_DIRECT in my.cnf, to prevent double buffering.
Configure as much memory as possible to InnoDB’s buffer pool. For this, you need to find a way to tell Linux to keep a small OS file cache. I am not sure if Linux currently has any switch for this.

PZ: Generally you can get much lower than 20% for Linux File Cache with good results. I just wrote the post on this this matter the other day – choosing innodb_buffer_pool_size

Q36: There have been several MySQL bugs opened about multi-core scalability (concurrent queries, autoincrement, concurrent inserts, instrumentation, etc.). Rather than treat each defect as an individual case, is there any performance testing going on to quantify what are the limits of different operations for different levels of concurrency, specifically around mutexes and contention? How does this fit into the development process?

HT: Contention and thread thrashing depends very much on the hardware, and also on the OS version. In our tests, removing one bottleneck moves the contention to another place. Thus, no general rules can be established.

I have assigned Inaam as the InnoDB scalability engineer. We do not have much resources for scalability research. Peter Zaitsev, Vadim Tkachenko, Yasufumi Kinoshita, and Mark Callaghan from outside Innobase have helped us greatly to spot scalability problems and find fixes for them. The work continues. I believe improving scalability is a continuous process for all major database brands.

Q37: We have a fairly large Innodb table (150GB) that is showing poor performance for full table scans. using O_DIRECT, we are seeing individual 16KB paged I/O going out one at a time, that are esentially reads misses (10ms each). The reads are not excactly sequential, there is a small gap between each. it is difficult to figure our what the problem is. We can OPTIMIZE the table, but how long will this take? Is there a facility we can run to show any table / index fragmentation, without having to run the OPTIMIZE?

HT: Since OPTIMIZE rebuilds the whole table, it may last way too long for your 150 GB table!

If you have inserted in an ascending order of the PRIMARY KEY, then InnoDB should have placed the rows in contiguous blocks of 64 pages == 1 MB. Then a full table scan would be fast. Maybe the table has become fragmented? Or maybe you have not inserted in the ascending order of the primary key?

I suggest that you test reorganizing that table on a separate computer, maybe a replication slave. Can you make the reorganization run in a tolerable time?

PZ: Indeed using master-master setup and switching roles is a good way to get large data manageable. MySQL Master Master Manager can help you to automate it a bit. I also like to keep tables small enough, “sharding” to many tables before they become so large it would take days to do operations on them. Idea for Heikki – it should be possible to implement online reorganize which locks and reorganizes small portions of table online.

Q38: Seems that adaptive hash indexes are not used for primary key lookups or not showed up in â€œshow innodb statusâ€.

select * from table where id=5
show innodb status:
Hash table size 10624987, used cells 1, node heap has 1 buffer(s) 0.00 hash searches/s, 15979.01 non-hash searches/s

HT: Strange, primary key equality searches should work through the adaptive hash index! What is your database like? Do you have a big enough buffer pool?

PZ: I guess it may be bug in the status or bug in implementation. I’ll check with Alexander or check myself to see what it could be.

0 0 votes

Article Rating

9 Comments

Oldest

Newest Most Voted

Jay Janssen

18 years ago

>Q29: Would doubling the amount of RAM from 8Gig to 16Gig help the write performance of Innodb on Solaris?
>HT: Yes, if your â€˜write working setâ€™ fits in 16 GB. I mean, if the database buffer pool can buffer many writes to the
> same page, then fewer disk writes are needed. But if your write working set is, say, 100 GB, then increasing the buffer > pool size does not help.

This is an interesting statement. Granted that an extra 8G of RAM for the buffer pool might not make that much difference with a 100G write working set, but it has to be somewhat better, doesn’t it?

Another question that I have is regarding the innodb_log_file_size in relationship to how many ‘dirty’ writes are allowed. Certainly we can’t have more dirty pages than buffer pool, but that the trx logs also will limit how many dirty pages that are possible. Is there any good way to determine this limit? Is there any way to know when we’re hitting this limit?

Author

Peter Zaitsev

18 years ago

Jay,

As I clarified that “somewhat” may be in very wide range depending on your workload. Typically you see improvement.

Regarding log files you can check how long “full cycle” takes – how long does it take Innodb to make full round writing the log files, this is the maximum time dirty page can be kept in the buffer pool without flushing.

Dormando

18 years ago

http://www.google.com/search?hl=en&q=gigabyte+iram&btnG=Google+Search

I’m not sure if someone’s done performance testing with one of these yet (and there’re many other similar models, mostly much more expensive). I plan on doing so as soon as projects permit. I’d like to put InnoDB logfiles and the latest binlog on these memory backed devices, removing those writes/reads off the disks entirely.

Bill

18 years ago

Nice. Thank you to both Peter and Heikki for doing this. I understand the answer to Q36, but in the absence of real hard numbers, fuzzy approximate ones can be very helpful. It would be great if there was just this magical chart that you could look at that told you if your normal load was X reads per second, Y writes/s, Z updates/s with a table of size Q Gb on OS R then these are the bottlenecks you will run into. I understand that it would take forever ( and thus some serious $$) to create and it isn’t your highest priority, but that would be awesome. You could plan your db needs based upon reasonable performance approximations, and purchase hardware accordingly. It would be a good marketing and/or consulting tool at least.

Author

Peter Zaitsev

18 years ago

Bill,

In our performance monitoring patches we now implemented per query working set estimation. It would not be hard to implement global working set estimation as well. Imagine table space is 1000000 pages we allocate bitmap which holds 2 bits per page telling us if page was written to or read during certain timespan, say 1 minute. Now we can pretty easily set bits in this bitmap and dump it to disk (it is just 250K/minute) and when use some tools to analyze how many distinct pages are being written/read per given time interval and so estimate how different query ratios will impact it. One can also estimate with fixed memory hash similary as we do for queries.

Martin

18 years ago

I just want to thank Heikki for answering! Big thanks to Peter too!

mike

18 years ago

Hi,

With regard to Q34 about SSD. There are vendors that have onboard RAM cache incorporated into their SSD.
The first vendor of SSD we looked at didn’t have this feature and when we benchmarked it, we found it was
indeed slower than mechanical disk. Would be interesting to hear comments about using SSD from a vendor
like BitMicro which has onboard RAM cache. We are planning on benchmarking this.

So I wouldn’t offload my log files to SSD unless the SSD incorporated an onboard RAM cache. Otherwise, as
Peter and Heiki say,there would be no benefit.

Note: A lot of the newer boxes SUN makes have internal SAS drives. If you already have external RAID arrays
with BBU write cache, then I don’t think it makes sense to offload the log files to these internal drives. If you
do, then you would be sacrificing a lot of performance. If you don’t have BBU write cache in an external RAID
array, then it might make sense to offload the log files to the internal drives.

Mattias J

18 years ago

Would anyone happen to know, to what degree the different isolation levels in InnoDB affect CPU usage?
That is, how much CPU will threads waiting on locks use…?