Introducing Ark: A Consensus Algorithm For TokuMX and MongoDB

PREVIOUS POST
NEXT POST

Most of the time, our blog posts explain what’s great about the MongoDB improvements we’ve already shipped in TokuMX. Sometimes, though, it’s fun to talk about what’s coming soon, especially when user feedback would really help get the feature right. In my next series of blog posts, I get to geek out and talk about a feature we have been developing that I personally find really exciting: Ark.

What is Ark?

Ark is an implementation of a consensus algorithm (also known as elections) similar to Paxos and Raft that we are working on to handle replica set elections and failovers in TokuMX. It has many similarities to Raft, but also has some big differences.

Here is a tech report that explains the algorithm and provides proofs of correctness. Please download it and tell us what you think.

Why design Ark?

In short: to fix known problems with the election protocol used by TokuMX and MongoDB.

As many know, MongoDB’s existing election protocol has issues. Kyle Kingsbury, known as “aphyr” on twitter, showed some basic behavioral problems when analyzing MongoDB as part of his Jepsen series of blog posts analyzing the impact of network partitions on distributed databases. A conclusion he arrived at was “MongoDB is neither AP nor CP”, when considered in the context of the CAP theorem.

Because TokuMX inherited MongoDB’s election protocol, we inherited these problems, and we want to fix them.

Our main goal is to modify the election protocol to make TokuMX a true CP system. That is, in the face of network partitions, TokuMX will remain consistent. To do so means ensuring that any write that is successfully acknowledged with majority write concern is never lost in the face of a network partition. This is not currently the case for TokuMX and MongoDB.

Additionally, we want to fix other known user experience issues that we know of, such as SERVER-9848 and SERVER-8084, along with issues we discovered while analyzing the code, such as SERVER-14382. The high level goal is to improve failover behavior. The tech report linked above details the issues we see and our approach for fixing them.

Is Ark implemented?

Currently, yes. We have an implementation on github in the election3-sandbox branch of TokuMX.

But we have yet to ship it.

That paper is heavy reading! Is there a simple explanation of what has been done?

Not yet :). Over the next few series of posts, I will be explaining in layman’s terms what the algorithm is, in smaller, more digestible pieces. So stay tuned…

Are there any shortcomings to Ark?

Nothing is perfect :). We hope the community can give us feedback on what we’ve done. The biggest shortcoming of Ark is that it currently does not address replica set reconfigurations. The Raft paper does. We have not addressed reconfigurations yet because we are looking to improve this area in steps. In short, we are busy and need to manage this project in incrementally. We want to get this important first step of fixing majority write concern and other important issues done, before addressing what we hope is a more rare scenario of handling configuration changes during network partitions.

What can the community do?

In short, give us feedback, any feedback. If there is feedback from reading the tech report, we’d love to hear it. If there is feedback from reading the code, we’d love to hear it. If anyone would like to run the code, please let us know and we will happily provide a not-for-production binary containing the new algorithm.

So, in short, tell us anything.

PREVIOUS POST
NEXT POST

Share this post

Comments (7)

  • Mark Callaghan Reply

    Wow. Thanks for documenting a few serious flaws. And more thanks for fixing them. Do you include Jepsen tests in your QA?

    July 19, 2014 at 9:06 am
  • Andy Reply

    Why is Ark limited to MongoDB? Will it be used for TokuDB on MySQL?

    July 23, 2014 at 1:39 pm
    • Zardosht Kasheff Reply

      Ark is designed for MongoDB because it is built on top of MongoDB’s existing replication and HA algorithms.

      July 23, 2014 at 1:54 pm
  • Yaniv Shemesh Reply

    Does the ARK resolve all the consistencies problems mentioned in aphyr recent blog:
    https://aphyr.com/posts/322-call-me-maybe-mongodb-stale-reads
    Does it need additional configuration?

    May 15, 2015 at 2:10 pm
    • Yaniv Shemesh Reply

      I got an answer:
      TokuMX only fixes the data loss problems in the mongo replication design (by tying acknowledgements to election terms a la raft). TokuMX does NOT fix the stale/dirty reads problem!

      Mongo clients just don’t expect to verify that the server they’re talking to is actually the primary; without changing clients or adding a lot of server-side logic to proxy for clients, I don’t think this is fixable in the replication design mongo uses and TokuMX inherited.

      May 28, 2015 at 9:50 am
  • Losing it? | InsideMySQL Reply

    […] read the excellent series of posts from Tokutek on replica set failover including the overview and posts one, two, three and four. There have been problems that haven’t been widely known. […]

    October 12, 2015 at 7:41 am

Leave a Reply