Buy Percona ServicesBuy Now!

Exciting Pre-Cursors to MongoDB 3.6 Transactions

 | December 13, 2017 |  Posted In: Insight for DBAs, Insight for Developers, MongoDB, Percona Server for MongoDB


MongoDB 3.6 TransactionsIn this blog post, we are going to focus on the pre-cursors to MongoDB 3.6 transactions.

As already discussed in this series, MongoDB 3.6 has a good number of features in it. Many of them center around sessions (which David Murphy already talked about). Some are highly anticipated by customers, whereas others are very case-specific.

We’ll look at the following areas involving MongoDB 3.6 transactions:

  • Oplog Changes
  • Cluster Time
  • Retryable Writes
  • Causal Consistency
  • Two-Phase Drops
  • killSessions

Oplog Changes: Dynamic Oplog

The oplog collection is responsible for holding all the changes received by the primary so that the secondary can replicate them. It is a circular collection that can hold only a certain amount of data. The difference between the first and last command in that collection is called oplog window. The value of this window is the amount of time we can have a secondary offline without performing an initial sync.

This new feature – which is only available when using wiredTiger – doesn’t require any downtime. Now it is possible to resize the oplog online, compared to the rather complicated previous process of taking nodes out of the rotation and manually running commands on them. In MongoDB 3.6, it’s as simple as running the following on a primary:  db.runCommand({replSetResizeOplog:1, size: 16384}). Then you’re done! As this is a command, it also replicates to your secondaries automatically in due time, making your operation teams lives much better.

Logical Lamport Clocks, aka “Cluster Time”

The Cluster Time feature solves a known issue that when we have instances’ clocks out of sync. This historically was a big pain, even when people were using NTPd. This new version gossips the cluster time to each other and keeps this internal clock from trusting the machine’s clock. It means it is virtually impossible to go back in time, as all the operations occur sequentially unless you aren’t writing with w:majority (which is now the default).

This begs the question, “Why do we need this now?” The MongoDB community has long desired to have transactions like RDBMs do. This is not to say they are always needed, but the option is nice to have when you do need it. Sessions are the part of this of course, but so is making sure the whole cluster agrees on time so that you can correlate transactions changes across the entire cluster.

We must always remember one of the best points to MongoDB is its ability to scale out via sharding, and all features and expectations should still work in sharding. So Cluster Time helps move users farther in this direction.

Retryable Writes

Retryable Writes make your application more resilient. It helps programmers make sure their writes were committed in the database. Before the advent of these retryable writes, most of the applications use “try and except” to make sure the writes were safe, which doesn’t work for everything! Let’s suppose a server receives a write (update), but the client doesn’t receive the ack that the write was performed (this could be due to a network issue, for example). If using “try and except”, this application re-sends the write. If this update is not idempotent, we can have an unexpected value after this command runs twice. If your scratching your head about how this works, here is a real-world example.

We have a collection of store.products, and we want to remove something from the inventory because someone has placed an order. You might want to run something like db.products.update({_id: ObjectID("507f191e810c19729de860ea"),{qty:{$inc: -1}}). Now we get to the real issue. The driver sends the request to your replica-set, but oh no! – an election occurred right after we sent the request. How can we tell if the new primary got this change or not? This is what is called a non-deterministic bug: you’re not clear on what your next steps should be. You don’t know what the inventory was so your not sure if one item was removed from it or not.

We used an example here of an election, but it could be a dropped packet, network split, timeout or any number of other issues in theory. This is exactly the issue we had in MongoDB before 3.6, however now this isn’t an issue. As described in sessions, we do know what the sessionID was, so we can simply ask the replica-set again what happened in that ID once we have a primary again. It will let us know if the change was fully applied or rolled back. Now we know exactly what to do giving use a deterministic behavior. Hurray!

Causal Consistency

Causal Consistency takes advantage of “Cluster Time” in order to offer consistent reads from secondaries. Before the MongoDB 3.6, any application that depended on its own writes needed to read from the primary or use the writeConcern to make sure all the secondaries received the write. This feature traded-off speed for consistency, as the subsequent read waits until the secondary is in the same state (virtual clock) as when the write was performed.

Consider this for a second. Causal consistency can be defined as a model that captures the relationships between operations in the system, and guarantees that each process can observe them in common causal order. In other words, all processes in the system agree on the order of the causally related operations. For example, if there is an operation “A” that causes another operation “B”, then causal consistency assures that each of the other processes in the system observes A before observing B. Therefore, causally related operations and concurrent operations are distinguishable in this model.

In more detail, if one write operation influences another write operation, then these two operations are causally related. One relies on the other. Concurrent operations are the ones that are unrelated or causally independent and are not bound by order in the same way. As such, the following condition holds:

  • Write operations that are related are seen by each process of the system in common order
  • Write operations that are NOT related can occur in any order, and can be seen in different orders by the processes in the system

Fundamentally this means causal is always weaker than full sequential consistence. However, it ignores the pain points related to waiting for all operations, and only has to wait for related ones.

Two-Phase Drops

In the past, if you have ever been in a sharded setup you likely ran into a common issue, where you want to drop a sharded collection with some amount of data. You run the command, then try to run sh.shardCollection(…) to recreate to load data into it. Maybe you had a bug where you corrupted the records, and you wanted to fix it and balance all that at the same time. Unfortunately, you get a “collection already sharded” error back, or maybe it already exists in multiple places. How could this be?

In the old system, you would have the drop command run immediately in kind of a UDP fashion or fire and forget. As long as one request returns, it worked. While this is an extreme example, we have seen them before.

Two-phase drops changes how a collection is removed in a replica set/shard, and depends heavily on a new underlying UUID hidden name for each namespace. By dropping with these, it can be more deterministic. Additionally, it means we can move store.products with UUID XXXXXXXXX out of the way and make a new one with UUID YYYYYYYYYY. Then we can lazily cleanup the old location without blocking you loading data, or without locking up your application.

Furthermore, it allows for sharding the new collection while we remove that 1TB from the old collection UUID location. It also can retry deleting with certainty (we are talking about the old version, not the new one). By using the new Cluster Time feature, the collection can be marked as deleted (and be hidden) and will, in fact, be erased only after all the instances acknowledge that command – solving all the issues in one go.

New “killSession” Command

If you had to kill an operation in MongoDB, you know you need to go into the db.currentOp().inprog, find it and then run db.killOp(op_we_found). Easy as pie. You might also know that if you use sharding, and we ran a bad query doing a massive table scan, it’s not nearly as easy. We literally have to crawl every single server hunt through all the running processes to find it and kill it, then we have to go to the mongos, and kill it also. This assumes we could even find it. Matching separate threads in replica sets for a scatter-gather is super hard, and not very reliable as a solution.

Now enter the killSession command, which makes it possible to kill a session using the ID given to the clients or found in the mongos layer. It can then automatically trace back to the sessions on each shard, as they all use the same ID for the cluster-wide operations. killSession comes to help administrators to list, manage and kill cluster-wide queries. Not only does this eliminate hours of troubleshooting over the course of a year, but it also keeps your team’s sanity intact.

I hope you find this article on discovering what is new in MongoDB 3.6 transactions interesting. If you have any question or concern, please ping me @AdamoTonete or @percona on twitter.

Adamo Tonete

Adamo joined Percona in 2015, after working as a MongoDB/MySQL Database Administrator for three years. As the main database member of a startup, he was responsible for suggesting the best architecture and data flows for a worldwide company in a 7/24 environment. Before that, he worked as a Microsoft SQL Server DBA in a large e-commerce company, mainly on performance tuning and automation. Adamo has almost eight years of experience working as a DBA and in the past three years he has moved to NoSQL technologies without giving up relational databases.


  • > This new version gossips the cluster time to each other and keeps this internal clock from trusting the machine’s clock. It means it is virtually impossible to go back in time, as all the operations occur sequentially unless you aren’t writing with w:majority (which is now the default).

    Hi Adamo, I checked the 3.6 release notes and I do not see anything about the default write concern changing. Did you mean to say read concern?

    • Hi Tom, I could understand your confusion however with protocolVersion:1 majority is assumed. As such it could be a majority setting in 3.4, or be more forced in 3.6 with the forced move to this new protocol. You can read more about that @

Leave a Reply