November 20, 2014

Funniest bug ever

Recently my attention was brought to this bug which is a nightmare bug for any consultant.

Working with production systems we assume reads are reads and if we’re just reading we can’t break anything. OK may be we can crash the server with some select query which runs into some bug but not cause the data loss.

This case teaches us things can be different – reads can in fact cause certain writes (updates) inside which add risk, such as exposed by this bug.

This is why transparency is important – to understand how safe something is it is not enough to know what is this logically but also what really happens inside and so what can go wrong.

About Peter Zaitsev

Peter managed the High Performance Group within MySQL until 2006, when he founded Percona. Peter has a Master's Degree in Computer Science and is an expert in database kernels, computer hardware, and application scaling.

Comments

  1. Pedro Melo says:

    Sorry Peter,

    that is not the funniest bug ever…

    This is: http://code.google.com/p/blackgold/issues/detail?id=3

    Best regards,

  2. Jonas says:

    Hi,

    We inside cluster team actually consider it as one worst bug ever…
    Not exactly our proudest moment :-(

    But I agree, it’s so bad that it’s actually funny…

    /Jonas

  3. At least it looks like it doesn’t represent the typical usecase of what most people will be doing – but very serious indeed. I’ve always found this bug the funniest:

    http://bugs.mysql.com/bug.php?id=2

  4. Reminds me of the ‘noatime’ option on unix file systems.
    When first I learned that by reading any file or file property I commit a write – I was in utter surprise.

  5. Perhaps we should investigate the on-disk format of NDB so we can start providing data recovery services for it, too.

  6. Matthew Montgomery says:

    @baron, data recovery services for NDB disk data are not relevant.
    For this bug it is simply not an option, ndb just DROPed the tables from the cluster, you’d have to restore from backup.
    If disk data files do get corrupt on a particular individual node you simply restart that node with –initial option and those files are restored from the peer in the node group.

  7. Matthew Montgomery says:

    Correction… “simply restart that node with –initial option [after deleting the on-disk data and log files]” (–initial will not remove on-disk data files).

  8. Matthew, there is probably no system in existence (that uses on-disk storage) for which data recovery from on-disk files is not relevant. The point of our data recovery tools and services are to recover data that has been dropped, deleted, corrupted, etc and there is no backup. If it’s been dropped from the cluster, are the 1s and 0s on disk anywhere? If yes, then that’s exactly the type of scenario I’m thinking of. NDB can’t find the data anymore, but maybe something else can. And the customer might call us up, and we might write tools on the spot to do the recovery — that’s how our other tools got started ;-)

    What if there’s a bug in NDB such that the disk data files get corrupt on every node and there is no peer in the node group with a good copy? If it hasn’t happened yet, it may someday, who knows.

    Disclaimer: I have not investigated the on-disk format of NDB at all.

  9. peter says:

    Jonas, right. I meant “Funny” in this case which is really kind of tragic.

    Though it is very nice to see you got the fix for it relatively quickly and honestly publishing such bug also gives you a good credit.

  10. peter says:

    Baron, Matthew,

    Right. If you would have a good backup (with point in time recovery) we would not have any recovery tools. In practice however backups sometimes are found to be broken and you have to recover the data. Our experience shows no one is immune – number of companies you’ve think should have a backup have contacted us for help (with Innodb)

    With Innodb it is easy – thanks to the page format it is possible to locate data even if filesystem was totally ruined (like RAID meltdown).

  11. o.u. says:

    Wow, Shlomi Noach @ 4, re: the atime .. a write on every read, even from cache – I’m kind of shocked.

  12. It’s not quite that bad. It’s only once per second. (There’s only a write if the atime has actually changed, which is only true once a second.)

  13. johan says:

    Ha, it was a funny bug indeed. Unfortunately, I found it on a customer site :(
    -j

  14. Log Buffer says:

    “Peter Zaitsev shares the funniest bug ever.”

    Log Buffer #134

  15. o.u. says:

    Thank you Barron – ok, not as crazy as it sounded then, though still a shock.

  16. Hi Baron,

    Though I haven’t benchmarked myself the difference between ‘atime’ and noatime, please see what Linus Torvalds has to say about it:
    http://lkml.org/lkml/2007/8/4/98

    He claims more then 10% savings, though he wasn’t testing MySQL.
    Do you have any benchmarks comparing with/out “noatime”?

    Regards

  17. peter says:

    Shlomi,

    Linus mentions “mail spool” which tend to have a lot of tiny files…. and this is where overhead is significant. Unless you’re dealing with tens of thousands of tables in MySQL you’re in different situation. This also means any benchmarks you would like to do should be workload specific – in your particular case it is possible you will see significant gain.

  18. No, I don’t, but anecdotally I can say that it matters if you have a lot of files. For example, suppose you have 100k tables, which is pretty common in certain types of apps. That’s at least 300k files if you’re using indexed MyISAM tables (also common for the same scenarios). Now suppose that you’re accessing them all randomly; you can do the math at how many times you’ll be doing an atime/diratime write. In many cases you won’t access a file more than once a second, so each access suffers the hit.

    My anecdotal evidence is that I haven’t seen significant performance changes from adding noatime,nodiratime to the mount options on “normal” servers with a few hundred tables. You can change the mount options at runtime so it’s pretty easy to see.

  19. @Peter, @Baron,

    Thanks for the information. It does sound a lot more reasonable in light of your explanation.

Speak Your Mind

*