EmergencyEMERGENCY? Get 24/7 Help Now!

Introducing Ark: A Consensus Algorithm For TokuMX and MongoDB

 | July 18, 2014 |  Posted In: Tokutek, TokuView

PREVIOUS POST
NEXT POST

Most of the time, our blog posts explain what’s great about the MongoDB improvements we’ve already shipped in TokuMX. Sometimes, though, it’s fun to talk about what’s coming soon, especially when user feedback would really help get the feature right. In my next series of blog posts, I get to geek out and talk about a feature we have been developing that I personally find really exciting: Ark.

What is Ark?

Ark is an implementation of a consensus algorithm (also known as elections) similar to Paxos and Raft that we are working on to handle replica set elections and failovers in TokuMX. It has many similarities to Raft, but also has some big differences.

Here is a tech report that explains the algorithm and provides proofs of correctness. Please download it and tell us what you think.

Why design Ark?

In short: to fix known problems with the election protocol used by TokuMX and MongoDB.

As many know, MongoDB’s existing election protocol has issues. Kyle Kingsbury, known as “aphyr” on twitter, showed some basic behavioral problems when analyzing MongoDB as part of his Jepsen series of blog posts analyzing the impact of network partitions on distributed databases. A conclusion he arrived at was “MongoDB is neither AP nor CP”, when considered in the context of the CAP theorem.

Because TokuMX inherited MongoDB’s election protocol, we inherited these problems, and we want to fix them.

Our main goal is to modify the election protocol to make TokuMX a true CP system. That is, in the face of network partitions, TokuMX will remain consistent. To do so means ensuring that any write that is successfully acknowledged with majority write concern is never lost in the face of a network partition. This is not currently the case for TokuMX and MongoDB.

Additionally, we want to fix other known user experience issues that we know of, such as SERVER-9848 and SERVER-8084, along with issues we discovered while analyzing the code, such as SERVER-14382. The high level goal is to improve failover behavior. The tech report linked above details the issues we see and our approach for fixing them.

Is Ark implemented?

Currently, yes. We have an implementation on github in the election3-sandbox branch of TokuMX.

But we have yet to ship it.

That paper is heavy reading! Is there a simple explanation of what has been done?

Not yet :). Over the next few series of posts, I will be explaining in layman’s terms what the algorithm is, in smaller, more digestible pieces. So stay tuned…

Are there any shortcomings to Ark?

Nothing is perfect :). We hope the community can give us feedback on what we’ve done. The biggest shortcoming of Ark is that it currently does not address replica set reconfigurations. The Raft paper does. We have not addressed reconfigurations yet because we are looking to improve this area in steps. In short, we are busy and need to manage this project in incrementally. We want to get this important first step of fixing majority write concern and other important issues done, before addressing what we hope is a more rare scenario of handling configuration changes during network partitions.

What can the community do?

In short, give us feedback, any feedback. If there is feedback from reading the tech report, we’d love to hear it. If there is feedback from reading the code, we’d love to hear it. If anyone would like to run the code, please let us know and we will happily provide a not-for-production binary containing the new algorithm.

So, in short, tell us anything.

PREVIOUS POST
NEXT POST

7 Comments

    • The difficulty with running Jepsen is that we’ve struggled using it to hit these bugs in TokuMX 1.5. Luckily, TokuMX 1.5 has some changes over basic MongoDB that makes this bug harder to hit. As I’ll detail in future posts, basic MongoDB currently uses a timestamp as the most significant bits to their GTID. As I mention in http://www.tokutek.com/2014/03/comparing-a-tokumx-and-mongodb-oplog-entry/, we use two sequence numbers as our GTID. I did not say so in the post, but the reason we did this is we anticipated fixing elections with something like Ark.

      I think the reason we could not get Jepsen to hit this bug with TokuMX is that our GTID made it much harder to hit. We used carefully crafted unit tests for our QA so far. They can be seen in the election3-sandbox branch I link to in the post. A couple of the main tests we ran are those I created in https://groups.google.com/forum/#!searchin/mongodb-dev/transient/mongodb-dev/-mH6BOYyzeI/zYJzFuCZuesJ and https://groups.google.com/forum/#!searchin/mongodb-dev/SERVER-9848/mongodb-dev/WA–aofOjQI/cF2OBqorZxkJ

    • Ark is designed for MongoDB because it is built on top of MongoDB’s existing replication and HA algorithms.

  • Does the ARK resolve all the consistencies problems mentioned in aphyr recent blog:
    https://aphyr.com/posts/322-call-me-maybe-mongodb-stale-reads
    Does it need additional configuration?

    • I got an answer:
      TokuMX only fixes the data loss problems in the mongo replication design (by tying acknowledgements to election terms a la raft). TokuMX does NOT fix the stale/dirty reads problem!

      Mongo clients just don’t expect to verify that the server they’re talking to is actually the primary; without changing clients or adding a lot of server-side logic to proxy for clients, I don’t think this is fixable in the replication design mongo uses and TokuMX inherited.

Leave a Reply

 
 

Percona’s widely read Percona Data Performance blog highlights our expertise in enterprise-class software, support, consulting and managed services solutions for both MySQL® and MongoDB® across traditional and cloud-based platforms. The decades of experience represented by our consultants is found daily in numerous and relevant blog posts.

Besides specific database help, the blog also provides notices on upcoming events and webinars.
Want to get weekly updates listing the latest blog posts? Subscribe to our blog now! Submit your email address below and we’ll send you an update every Friday at 1pm ET.

No, thank you. Please do not ask me again.