Where the open source community meets: Secure your spot for Percona Live Amsterdam! - Register

Downloads

Blog

What’s new in TokuMX 1.4, Part 2: Partitioned oplog

February 18, 2014

Author

Leif.Walsh

MySQL

Share this Post:

We just released version 1.4.0 of TokuMX, our high-performance distribution of MongoDB. There are a lot of improvements in this version (release notes), the most of any release yet. In this series of blog posts, we describe the most interesting changes and how they’ll affect users.

In MongoDB, the replication oplog is a capped collection, with a fixed size on disk, and therefore the amount of history (measured in days) varies as the application makes changes faster or slower.

In TokuMX, capped collections don’t support as high concurrency levels as we want to provide for the oplog, so instead of picking a fixed disk size to cap it, we instead let users specify the number of days’ worth of operations to keep around (the expireOplogDays parameter) and we have background threads that delete old data to maintain this.

This is nice because it simplifies how you understand your cluster’s health. Unfortunately, it was achieved by running threads that are constantly trimming the tail of the oplog, which hold metadata locks and consume server resources. Sometimes they can cause bad stalls if a write lock (say, to add a member to the replica set) gets in line behind a long trimming operation. Some of this risk was taken care of in TokuMX 1.3.3, but in TokuMX 1.4.0 we are getting rid of these threads completely.

In TokuMX 1.4.0 the oplog is now a “partitioned” collection, which is also a new type of collection, similar to SQL partitioned tables. A partitioned collection is represented on disk by a set of normal collections, together with some metadata that specifies which data is in which collection.

There are currently many limitations for partitioned collections, so they probably aren’t useful for most applications yet, but they can be used by normal applications, and they’ll become more useful in the future. The current limitations include:

- Secondary indexes aren’t supported, they must use the default “_id” index as the only index, and they must be partitioned according to the _id field (and they cannot use primary keys).

- Partitioned collections can’t be renamed.

- The addPartition/dropPartition commands cannot be replicated, so they should not be used in a replica set outside of the local database.

For the oplog, each day a new partition is added and the oldest partition is dropped, which quickly and easily reclaims the space used by the oldest partition, without consuming extra resources or blocking other operations the way trimming could in the past. This will make cluster administration much simpler with respect to the oplog, for many users.

To get information about the partitioning of the oplog there are commands to report this metadata:

<code>rs0:PRIMARY> rs.oplogPartitionInfo()
{
       "numPartitions" : NumberLong(2),
       "partitions" : [
               {
                       "_id" : NumberLong(0),
                       "max" : {
                               "" : BinData(0,"AAAAAAAAAAEAAAAAAAAAAA==")
                       },
                       "createTime" : ISODate("2014-02-03T21:04:00.983Z")
               },
               {
                       "_id" : NumberLong(1),
                       "max" : {
                               "" : { "$maxKey" : 1 }
                       },
                       "createTime" : ISODate("2014-02-03T21:04:47.494Z")
               }
       ],
       "ok" : 1
}
rs0:PRIMARY> rs.oplogRefsPartitionInfo()
{
       "numPartitions" : NumberLong(2),
       "partitions" : [
               {
                       "_id" : NumberLong(0),
                       "max" : {
                               "" : {
                                       "oid" : ObjectId("52f004416ae5e77a3cbbbc4c"),
                                       "seq" : NumberLong(3)
                               }
                       },
                       "createTime" : ISODate("2014-02-03T21:04:01.109Z"),
                       "maxRefGTID" : BinData(0,"AAAAAAAAAAEAAAAAAAAAAA==")
               },
               {
                       "_id" : NumberLong(1),
                       "max" : {
                               "" : { "$maxKey" : 1 }
                       },
                       "createTime" : ISODate("2014-02-03T21:04:47.543Z")
               }
       ],
       "ok" : 1
}</code>

<code>rs0:PRIMARY> rs.oplogPartitionInfo()

{

"numPartitions" : NumberLong(2),

"partitions" : [

{

"_id" : NumberLong(0),

"max" : {

"" : BinData(0,"AAAAAAAAAAEAAAAAAAAAAA==")

"createTime" : ISODate("2014-02-03T21:04:00.983Z")

{

"_id" : NumberLong(1),

"max" : {

"" : { "$maxKey" : 1 }

"createTime" : ISODate("2014-02-03T21:04:47.494Z")

}

"ok" : 1

}

rs0:PRIMARY> rs.oplogRefsPartitionInfo()

{

"numPartitions" : NumberLong(2),

"partitions" : [

{

"_id" : NumberLong(0),

"max" : {

"" : {

"oid" : ObjectId("52f004416ae5e77a3cbbbc4c"),

"seq" : NumberLong(3)

}

"createTime" : ISODate("2014-02-03T21:04:01.109Z"),

"maxRefGTID" : BinData(0,"AAAAAAAAAAEAAAAAAAAAAA==")

{

"_id" : NumberLong(1),

"max" : {

"" : { "$maxKey" : 1 }

"createTime" : ISODate("2014-02-03T21:04:47.543Z")

}

"ok" : 1

}</code>

To manually trim partitions, you can specify a GTID or a timestamp to trim to. For example, with the above information, either of these commands will trim the first partition away:

<code>rs0:PRIMARY> rs.trimToGTID(BinData(0,"AAAAAAAAAAEAAAAAAAAAAA=="))
rs0:PRIMARY> rs.trimToTS(ISODate("2014-02-03T21:04:47.494Z"))</code>

1 2	<code>rs0:PRIMARY> rs.trimToGTID(BinData(0,"AAAAAAAAAAEAAAAAAAAAAA==")) rs0:PRIMARY> rs.trimToTS(ISODate("2014-02-03T21:04:47.494Z"))</code>

On upgrade, existing oplogs will be converted to partitioned collections with the old data all in one partition and an empty one after it for new data, and after enough time, it will be dropped all at once.

Want to check out the newest version of TokuMX? Download TokuMX 1.4.0 here: