Operation Systems do not provide good IO interface for Database ServersPeter Zaitsev
Thinking more about the problems I wrote about yesterday I had a question why so ugly workaround and guesses or manual configuration is needed ? The answer seems to be Operation Interfaces just do not provide IO interface which is good enough. The big missing piece is priority. There are process and threads priorities in most operation systems but there is no priority you can set for given IO request. In fact in most cases even thread priority is not accounted for while executing IO request. Last time I checked there was IO scheduler with these functions in works for Linux Kernel but I have not seen anything production ready.
So what I would like to see ? Each IO request should be able to be assigned priority, at least from 3 hard classes – “RealTime”, “Normal” and “Idle”. More fine grain control is good but not necessary. What databases could do if such interface would exist ?
- RealTime – This priority could be used for Showstopper requests which stall all transactions until they are executed. Most common are syncronous log writes on transaction commits – no other transaction typically can commit until this log write is commited. It also could be used to flush dirty buffers if all buffer pool is in dirty buffers and we can’t read anything into it until we get read of some pages. Also obviously it would be helpful for prioritising transactions inside of database system itself.
- Normal – As name implies normal IO should be done with this priority – such as reading data and index pages – you want to get this data sooner than later but at least database server is not stalled while you’re waiting.
- Idle – This priority is great for flushing dirty buffers, checkpointing and other activity which you want to happen in background when there are enough resources. IO Scheduler should execute these only if there is bandwidth available so we can have as many idle requests in queue and it should not solve anything down. It also could be used for read-ahead requests, which are often speculative and we want them to happen if there is free bandwidth. Do not mix these with read buffering when we know we’re going to need certain data and so we’re using large buffer to read these – these can be done with higher IO.
The other concept which would be quite hepful (for asynchronous IO) is priority escalation. Aternatively you could of course simply cancel old requests and submit them with new priorities but it is ugly. This is needed as in certain cases load will be high enough so idle bandwidth will not be enough for flushing dirty buffers or keeping up with checkpointing. In this case you would like to raise priority of these activities but as a lot of IO requests already submited it should be good to increase priority of these.
I would be very interesting to hear if there is any development going on for something like this in modern operation systems.
In fact these are not new ideas. I’ve been working with Philippe Bonnet students from “Badger” projects at DIKU We tested some protototype patches for Linux kernel and MySQL but we did not get anything ready for production up to this point.