October 25, 2014

Falcon Storage Engine Design Review

Now as new MySQL Storage engine – Falcon is public I can write down my thought about its design, which I previously should have kept private as I partially got them while working for MySQL.

These thought base on my understanding, reading docs, speaking to Jim, Monty, Arjen and other people so I might miss something, something might be planned to be changed or not 100% correct but anyway you might find it interesting.

In many cases what I find good or bad would base of my MySQL use with existing applications – if you design new applications which are done specially for Falcon you might find those aspects positive.

[-] No Spinlocks Falcon does not use Spinlocks. It has its own lock implementation which does operation system wait if resource is unavailable. We’ll see where this leads us but I believe on multiple CPU systems you need some spinning done, at least for some types of locks where conflicts will be frequent. And now even laptops are getting multi cores so we can forget about spinlocks wasting CPU time on single CPU boxes without much of the gain. Spinlocks need to be wise though.

[+] Row Cache The fact Falcon has separate row cache is great as it means it can have much more data cached in the same cache size. You might need only single row from 8-16K page but page cache requires you to cache whole page, so needing 1000 of 100 bytes rows cached, which are all on different pages you need 100K of memory not 8-16M as you need with page cache. There are design techniques though to improve page cache efficiency in this respect, so in practice difference can be less.

[-] Not fixed cache size, but size range This is minor one. Unlike you would typically see Falcon uses minimum and maximum sizes for row cache instead of single size, this is complicated. If I have 8GB of RAM I can assume I want 4GB allocated to row cache and I do not really care about how these 4GB will be managed, to keep it the most efficient. Should I specify min size of 3GB or 3.8GB ? What is the difference ? Will purge run well too frequently if I keep them close. Manual does not say much on this matter at this point.

[-] Few configuration options This one is probably a bit contradicting to previous one but I put on my experimentator Hat here, not one of general user. I would like to see more configuration options to adjust database behavior to match my needs. I love this Zero administration thing and I would like database to manage things for me but it should first prove it can do it better than I do, and this can be only checked by comparing performance with your own optimized settings and one database engine would like to use itself. At this point I have not seen auto-configuration algorithms to be able to cover all kinds of workloads – they may deal well with something typical but not if you’re doing something unusual, which people do quite frequently with MySQL :)

[+] Compact Indexes Compact Indexes are great. Huge indexes is where large portion of Innodb performance problems come however. We should however see how “compact” they would be in real world cases and how much performance it will give. As “SHOW TABLE STATUS” is not showing index sized in the release we tested it is hard to see how small indexes really are.

[-] No “using index” Falcon always have to check the row data even if you retrieve only columns which are in the index. I think this is big gotcha as having index covered queries is great optimization for wide range of queries.

[+] Ordered data reads Unlike other storage engines which will read row data as they traverse the index Falcon can optimize it by reading index first (possibly many indexes) and only then reading row data in sorted order. This can help a lot if you traverse significant portion of the table (or data which is locally stored but in random order). In my opinion this however should have been implemented outside of MySQL Storage Engine level for MySQL – reading rows in their physical order can be applied to most storage engines, it certainly can be applied for MyISAM and Innodb. I remember there were plans to implement such handler interface but I’m not sure there they are now.

[-] No Clustering Index support Unlike Innodb Falcon does not cluster data by Primary Key or any other key. This is good for some applications bad for others. There is large amount of applications out where which are optimized for Innodb which does have clustering, this is why I have it with Minus sign. In my opinion Optional clustering is best, in fact this is what you can get in Innodb (by using its internal hidden primary key) but this functionality is not exported well as it requires you to get rid of primary keys etc. Some presentations say Falcon does not need clustering because of optimized read by the index. This is not the case. If you access small primary key ranges with Innodb you will need to perform just couple of page reads, with Falcon if rows were inserted in different times a lot of scattered reads will be needed. Think about typical use of Innodb clustering – users mailbox, clustered by (user_id,message_id).

[+] Row compression Falcon uses some nice fast row compression methods, for example using only as many bytes for integer as it requires, not storing column value if it is default etc. This is nice even though may show strange effects if you do not know about this feature, for example changing default value needs table rebuild even though for other storage engines it often could be done simply by changing meta data (it is not implemented in MySQL anyway though). Very interesting to see how this one will compare to transparent gzip page compression which is being implemented for Innodb right now.

[-] Tablespace per database You may notice right now each database in Falcon gets its own tablespace and its own set of logs. With typical for MySQL use of Databases just as directories it can be the problem – if transaction spawns multiple databases you need multiple log flushes on commit, plus ether you use XA to synchronize these which is expensive or it will be possible for your transaction changes to commit in one database and not in the other. The other issue you will see is having consistent snapshots being consistent for database only, so if you have started transaction accessed table in database A and a while after accessed table in database B, you will see records in B which were modified/inserted after you started transaction.

[-] Isolation Modes So far limited amount of isolation modes is supported, including no support for SELECT FOR UPDATE. This is also why Falcon has Optimistic locking concurrency which can give you problems if you have lock intensive workload so many update queries have to wait for row level locks. With Falcon at this point situation is rather funny, it will still wait for other transaction which updated the row to commit and then fail with “ERROR 1020 (HY000): Record has changed since last read in table ‘t'” error message. If you find Falcon or Innodb behavior better still prepare for this to be area where things are different and your application may need to be changed. This is also area where detailed documentation is missing so far.

[-] No protection from partial page writes This means if single page write was not atomic, so only part of page was changed while other part remains old your database may end up in non-recoverable stage. This is why Innodb has “innodb double write buffer”. Jim does not believe this is the problem but I remember having problem not the once and twice before Innodb Double Write Buffer was implemented and seeing some pages recovered from it after it was implemented. Speaking with Oracle and PostgreSQL developers it looks like they all acknowledge this problem and have techniques to deal with it, so I’m a bit surprised why Jim does not believe in it.

[+/-] Only committed changes go to the disk Innodb modifies data in the database when you change it, Falcon only when it is committed. This has some good sides such as no problem with very long rollbacks on recovery but also means you should have enough memory to hold all transaction changes. I agree most transactions are small and should be fine but if you have long running transactions, for example batch jobs you may end up in trouble or be forced to rewrite them to commit often.

[+] Blob handling optimizations Jim loves BLOBs as his own child, so Falcon is optimized in Blob handling for example having direct blob writes to the database and probably some others. This would be very interesting to see how these optimizations will show themselves in practice.

The other interesting point which I have not found much information about is regarding handling fragmentation – what happens on row updates, are they stored in the same position or in another location ? Does row ever split into multiple parts like with MyISAM or always stored as single piece ? These are going to be pretty important for IO bound workloads.

About Peter Zaitsev

Peter managed the High Performance Group within MySQL until 2006, when he founded Percona. Peter has a Master's Degree in Computer Science and is an expert in database kernels, computer hardware, and application scaling.

Comments

  1. Ann Harrison says:

    One small correction and a bit more information.

    “Unlike other storage engines which will read row data as they traverse the index Falcon can optimize it by reading index first (possibly many indexes) and only then reading row data in sorted order.” This should say “only then reading row data in storage order.” The optimization is reading in storage order.

    There are two levels of indirection in locating recods – the record number is actually
    the number of a record locator page and the offset on that page of the entry that has
    the data page number and index offset for the record. The index offset is a location on
    the data page containing the record’s length and offset. So when a record is updated,
    it just needs to be on a data page that’s covered by the record locator page. Normally,
    it will go back on the same page it had been on – easier that way – but if there’s no
    room, it can go on another page.

    Records are fragmented when they are larger than a page. Falcon fills overflow pages
    with the end of the record until the remaining piece fits on a data page. The same
    algorithm is used for blobs.

  2. peter says:

    Thank you Ann,

    I meant “storage order” by “sorted order” implying “physically sorted order”.

    And thank you for clarification of how row storage is happening – so you store it in place when it is possible (not always in new location like MaxDB or Solid) this is good.

  3. Jim Starkey says:

    Many interesting comments. Let me start by giving some short answers which I’m sure will lead to longer and even more interesting debates.

    No Spinlocks

    I’ve never been convinced that spinlocks ever make things better. I tend to lump spinlocks with the folklore that mutexes are cheaper than read/write locks. If someone would like to modify SyncObject to spin, I’d be interested in the results. If it’s consistently faster, I’ll change my mind.

    Row Cache

    Yes, a row cache is more memory efficient than a page cache, but that’s only part of the story. It’s also a great deal cheaper to reference a record in memory than in the page cache. The record cache also holds state information, maintains version pointers, and holds ancillary structures to speed up field access – all important but not worth storing on disk.

    No fixed cache size, but size range

    The trick in managing the row cache is maintaining some semblance of LRU for pruning. Maintaining an actual LRU would require a lock, which is unthinkable from a performance perspective. So Falcon keeps a cheap (but, alas, inaccurate) tally of record space by generation. When the upper limit is reached, a regularly scheduled scavenge thread prunes out old generations until the lower limit is reached. After thinking about it a bit, making the lower limit some fixed fraction of the upper limit makes sense. One fewer configuration parameter. Excellent!

    Few configuration options

    There are two main reasons for configuration tuning. One is allocation of a fixed asset, such as memory. There configuration management makes sense because the user probably knows how much memory can be allocated to Falcon. The other is that the designer didn’t know what he was doing. I actively avoid designs that require tuning to work well – they lead to a “blame the victim” attitude.

    Ordered data reads

    The down side of two phase index retrievals is one you identified in your performance study – bad performance in the “limit” case, particularly on indexes with large key sizes. But we’ll address that soon.

    No Clustering Index Support

    Clustered indexes are problematic in MVCC systems and have negative value in systems with a record cache. If the record is in cache, what’s the point of fluffing up the index with records? A denser index is a faster index. Clustered indexes are a vestige from the small, expensive memory past.

    No “using index”.

    First, the record cache greatly reduces the cost of looking up records in many cases. Second, in an MVCC system, not all index entries correspond to records visible by the current transaction; to use the index directly the index must also contain enough transactional information to identify records that are appropriate. We think that the cost of maintaining that information and the increase in size of indexes overwhelms the additional cost of checking records in those cases where the query could have been resolved from the index.

    Tablespace per database.

    This is purely an artifact of the alpha release. It will change before beta.

    Isolation Modes

    This is also on our list of things to address before beta. However, keep in mind that in MVCC imposes different behaviors with regard to update conflicts. Falcon was designed to support the possibility that I could be convinced that record locking has significant merit in real applications, which hasn’t happened yet.

    It’s worth noting that InnoDB allows a transaction to overwrite changes that are not returned from a normal select. The non-standard “FOR UPDATE” clause does cause a select to read the most recent version, but breaks the rules for repeatable read. The Falcon rule is simple – if you can’t see it, you can’t change it.

    No protection from partial page writes

    Interbase used to checksum pages, but calculating the checksum took about 15% of the total CPU time, so it was dropped. Maybe machines are fast enough this isn’t an issue anymore, but it is a feature guaranteed to lose benchmarks against systems that don’t checksum pages. In any case, I did leave a short in the page header to support page checksums if we ever decide we need it. Incidentally, lest anyone is thinking about the hack of writing a signature word on the first and last bytes of a page, that doesn’t even come close to working. With checksums, the InnoDB double page write might be a good thing to add, assuming that the performance is acceptable.

    Only committed changes go to the disk

    There’s no reason that uncommitted records can’t be written to the serial log before commit. All that is necessary is the bookkeeping to read the record from the log (if necessary) before it gets flushed to the page cache. This is another of the pre-beta tasks.

    Blob handling optimizations

    Falcon’s concept of blobs centers around jpegs, pdfs, and Word documents – things from a half a megabyte to maybe 8 to 10 metabytes. Falcon automatically maps “tiny blobs” (an oxymoron!) to varchars, but real blobs are stored in a separate storage segment. The huge different from other storage engines that Falcon doesn’t materialize blobs unless the blob is referenced and doesn’t rewrite blobs during an update unless the blob itself has changed. Blobs are also written asynchronously pre-commit to keep lots of big ugly stuff out of the serial log.

    Other Stuff

    Rows on disk can be fragmented if they’re too big to fit on a page or change size and no longer fit. We could move a record to a page with more space, but don’t at the moment. The on-disk structure is designed around the page cache manager, so the cost of a fragmented record probably isn’t worth worrying about. For real world applications, 95+% of record access is from the record cache, not the page cache. Disk access time is important, but we expect that real applications will tend to keep most interesting records in cache.

  4. peter says:

    Jim,

    Thank you very much for your comments.

    Spinlocks: I see you use Windows terminology. Would be interesting to check what you use on Windows. Surely spinlocks is something which can be added simple enough.

    Row Cache: Yes I see it also improves row lookup speed. Innodb uses alternative approach with adaptive hash indexes.

    Two sizes: Right if you say max-size*0.9 is good value for all cases why to have two values, if not it would be good to know when to set it to which value.

    Configuration: Memory configuration is actually easy one to solve automatically, there is however bunch of things which can be hardware related (CPU and disk concurrency), some are environment related – log flushing on transaction commit, size of log files etc. Regarding design – this is something which can’t really be changed in many cases. Performance problems tend to appear then application is in production

    Ordered data reads: As you see I have not even mentioned that Limit thing as I assume that is going to be fixed.

    No Clustered Index: I think you relay on row cache too much. If working set fits in memory (row cache) there is no problem but if it does not this is there problems start.

    Using Index: Again you look at the case then data fits in memory, in this case there is no problem the main benefit is actually dramatically reducing number of IOs needed for range scan for IO bound queries. Yes I see there is tradeoff – if you have MVCC and you keep version information in indexes as well they would be more expensive to maintain and take more space.

    Tablespace per database: Great to know it will be changed. I would emphasize it in current Limits section though.

    Isolation Modes: Innodb, Oracle, PostgreSQL also support MVCC but frequently offer it in different way. The problem I see with current behavior is – it can backfire as load growths and so concurrency increases – on low concurrency even MyISAM with table level locks does just fine. SELECT FOR UPDATE might not be standard but quite commonly used and is quite helpful in many cases. Again you may call it wrong design but it may be complicated to change a way people get use to doing things, they may just use different solution instead. Anyway good to hear you plan some changes in this aspect for Beta.

    Protection from partial page writes: Checksum can be done optional. Oracle and Innodb have a global option but could be done on Table, same way as MyISAM does it. Checksums are good for many reasons, ie tracing bad memory or hardware melt down before corruption spread all around database. Some other techniques however are needed to prevent from partial page writes breaking database.

    No uncommitted changes to the disk: Good to know this will be supported. If you place it to the log though you will need some extra logic to locate the data if transaction reads data again and version is not in memory any more, or you may throw an error.

    Blob Handling: Just regarding your comment about blob materialization – Innodb also will only read the blob if it is accessed. It also should not modify it if it was not changed by update statement.

    Thanks for clarification about row fragmentation – so row is not moved to new page if it does not fit to the old one any more but gets fragmented now right ?

    Regarding real world applications – very significant part of my consulting practice comes from the cases which corresponds to IO bound workloads. With modern CPUs if you data is in memory you can do a lot of stupid things – millions of rows can be crunched per second, but then your data stops fitting in memory you can get in trouble pretty quickly.

    And yes one more thing: Falcon Stats are missing, so there is no insight so far on internal processes which are happening.

  5. Apachez says:

    I like the idea of fewer configuration options.

    As a databasadmin I basically only care about how much memory in a overloaded sitation which the database might use. Given that the internal memory caches is optimized or has code that optimizes itself during runtime.

    As a compare is the memory options in regular mysql which basically is just keybuff + sortbuf, readbuf, readrndbuf and joinbuf. The problem with the last four is that changing them has a huge impact on mysql performance and the default values is just not good enough. So if mysql would throw away the sort/read/readrnd/join settings and exchange them into a selfconfiguration type of values just keeping the keybuffer (or even better exchange that into a “mysqlmem” setting) it would be just great.

    By the way, is falcon now the new official transactional db storage in mysql since innodb got bought my oracle?

    And what happend to soliddb, wasnt that suppose to take innodb’s place in mysql?

  6. peter says:

    Apachez,

    What you’re saying is “one option is best, if database can manage it optimally” I totally agree with you and I would enjoy having zero options and let database to read my mind what is really important for this application (as it is tradeof in many cases).

    My main point is it is not possible or hard to get it right so have automatically detected defaults but still let users to adjust behavior if algorithm fails for some reason.

    Compare it to MySQL Optimizer for example. It is designed to provide optimal query plan right ? But we all know it does not always do it and no optimizer gets it right in 100% cases – this is what hints are for.

    Removing hints before optimizer is right in 100% cases would be silly, I feel same about configuration options.

  7. Apachez says:

    Yes but at the same time we can see that the current configurations of mysql is not good either. For example take the case of readbuffer which at first sight might sound good to set to 32M or higher. Unless if you use subqueries because then mysql will puke at you and start going slower than an excel97 :P

    Would be interresting if some auto settings could be developed within mysql so we could get the best from each world… in other words allow both “read_buffer = 32M” aswell as “read_buffer = auto”.

    Also the examples which is included mysql (my-huge.cnf etc) seems a bit outdated regarding how the memory is being used in today versions of mysql 5.x.

  8. peter says:

    Apachez,

    Right now you simply should not run MySQL with default settings – performance will be really poor.

    Some auto configuration functions would be good but take out parameters would not be.

  9. James Day says:

    Apachez, read buffer or sort buffer? The sort buffer was recently changed so it would be reused in subqueries instead of being reallocated. If you’re seeing the same problem with the read buffer please provide a bug with a test case like the one in http://bugs.mysql.com/bug.php?id=21727 and it will probably be fixed quickly. Please mention that James asked you to report it.

  10. Chris Stephens says:

    I think that the “No Clustering Index support” would be the biggest negative for me, or anyone that has had to deal with very large databases. While Jim makes the point that memory is cheap, just as the price of memory has continued to fall the amount of data that is being collected and stored has continued to rise; which cancels out the cheaper memory solution for companies that are dealing with large amounts of data.

  11. Arioch says:

    > I agree most transactions are small and should be fine but if you have long running transactions, for example batch jobs you may end up in trouble or be forced to rewrite them to commit often.

    Falcon is obviously derived from previous Jim’s work on Groton database (later it was known as Interbase, now as Firebird SQL server). Any one familiar to those db engines will recognise many features, described above, as well-known strong and poor points of Firebird SQL server.

    For example, about transacttions above, preferred way to use Interbase/Firebird is a bi-transactional mode: one transaction is never-closing read-only read-commited one, another connection used to commit all the changes to database in a serie of short-lived not-read-only serializable transactions. Guess something alike will work fine for most application over Falcon engine.

    BTW, one known drawback about Firebird/Interbase is its beeing ACID on row-level but being not on row-level. For example INSERT INTO TABLE SELECT * FROM TABLE would not duplicate a table, buit instead continue cloning itself until all possible disc space for database file would be claimed. I wonder if this is the same for Falcon SQL engine ? this is not obvious rooks on you step-way :-) If found true, i think anyone experimenting with Falcon engine is better to keep an eye on Firebird/Interbase historic users experience, until with development going on codebases wold diverge that far that features and drawbacks of engines would became unrelated (will it? both engines are based partially on Jim’s idea of most practical database, hence with different code it perhaps would tend to achieve the same goals)

  12. Arioch says:

    typo: beeing ACID on **column-level** but being not on row-level

  13. jonas-AT-mysql.com says:

    ADVERTISING:
    Ordered data reads

    Is implemented in mysql-server level, it’s internally referred to as BKA (batched key access)
    It’s implemented for cluster (to reduce latency, instead of ordering reads)

    I’m not sure exactly which mysql version this will appear in.
    References:
    http://forge.mysql.com/worklog/task.php?id=2771

  14. Pgsql says:

    I’m laughing so hard reading this. Falcon should be the best thing since sliced bread and at least from this discussion I can see that it will be (like other MySQL stuff) just a hack, not a viable database engine. Maybe things will change, maybe not. I’m even more glad I never took on using MySQL but Pgsql instead. Zero problems, zero crashes, zero database corruptions in 10 years with hundreds of gigabytes of data.

    And no, I don’t want to start a flame war, but this really doesn’t look like real database development to me. But this discussion is old, maybe I should read more what’s been done in the past 1,5 years. I won’t hold my breath, though.

Speak Your Mind

*