Where the open source community meets: Secure your spot for Percona Live Amsterdam! - Register

Downloads

Blog

Percona How To: Field Names and Document Size in MongoDB

March 24, 2016

Author

David Murphy

MongoDB

Share this Post:

In this blog post, we’ll discuss how shorter field names impact performance and document size in MongoDB.

The MongoDB Manual Developer Notes state:

Shortening field names reduce expressiveness and does not provide considerable benefit for larger documents and where document overhead is not of significant concern. Shorter field names do not lessen the size of indexes because indexes have a predefined structure. In general, it is not necessary to use short field names.

This is a pretty one-sided statement, and we should be careful not to fall into this trap. At first glance, you might think “Oh that makes sense due to compression!” However, compression is only one part of the story. When we consider the size of a single document, we need to consider several things:

- Size of the data in the application memory

- Size over the network

- Size in the replication log

- Size in memory in the cache

- Amount of data being sent to the compressor

- Size on disk*

- Size in the journal files*

As you can see, this is a pretty expansive list, and this is just for consideration on field naming – we haven’t even gotten to using the right data types for the value yet.

Further, only the last two items in the list (“*” starred) represent any part of the system that has compression (to date). Put another way, the conversation about compression only covers about 25% of the discussion about field names. MongoDB Inc’s comment is trying to sidestep nearly 75% of the rest of the conversation.

To ensure an even debate, I want to break size down into two major areas: Field Optimization and Value Optimization. They both touch on all of the areas listed above except for sorting, which is only about value optimization.

Field Optimization

When we talk about field optimization, it is purely considering using smaller field names. This might seem obvious, but when your database field names become object properties in your application code, the developers want these to be expressive (i.e., longer and space-intensive).

Consider the following:

locations=[];

for (i=1;i<=1000;i++){

   locations.push({ longitude : 28.2211, latitude : 128.2828 })

}

devices=[];

for (i=1;i<=10;i++){

   devices.push( {

       name:"iphone6",

       last_ping: ISODate(),

       version: 8.1 ,

       security_pass: true,

       last_10_locations: locations.slice(10,20)

   })

}

x={

   _id : ObjectId(),

   first_name: "David",

   last_name:     "Murphy",

   birthdate:     "Aug 16 2080",

   address :     "123 nowhere drive Nonya, TX, USA , 78701",

   phone_number1:     "512-555-5555",

   phone_number2:    "512-555-5556",

   known_locations: locations,

   last_checkin : ISODate(),

   devices : devices

}

>Object.bsonsize(x)

54879

locations=[];

for (i=1;i<=1000;i++){

locations.push({ longitude : 28.2211, latitude : 128.2828 })

}

devices=[];

for (i=1;i<=10;i++){

devices.push( {

name:"iphone6",

last_ping: ISODate(),

version: 8.1 ,

security_pass: true,

last_10_locations: locations.slice(10,20)

})

}

x={

_id : ObjectId(),

first_name: "David",

last_name: "Murphy",

birthdate: "Aug 16 2080",

address : "123 nowhere drive Nonya, TX, USA , 78701",

phone_number1: "512-555-5555",

phone_number2: "512-555-5556",

known_locations: locations,

last_checkin : ISODate(),

devices : devices

}

>Object.bsonsize(x)

54879

Seems pretty standard, but wow! That’s 54.8k per document! Now let’s consider another format:

locations2=[];

for (i=1;i<=1000;i++){

   locations2.push({ lon : 28.2211, lat : 128.2828 })

}

devices2=[];

for (i=1;i<=10;i++){

   devices2.push( {

       n:"iphone6",

       lp: ISODate(),

       v: 8.1 ,

       sp: true,

       l10: locations.slice(10,20)

   })

}

y={

   _id : ObjectId(),

   fn:     "David",

   ln:     "Murphy",

   bd:     "Aug 16 2080",

   a :     "123 nowhere drive Nonya, TX, USA , 78701",

   pn1:     "512-555-5555",

   pn2:    "512-555-5556",

   kl:     locations2,

   lc :     ISODate(),

   d :     devices2

}

> Object.bsonsize(y)

41392

> Object.bsonsize(y)/Object.bsonsize(x)

0.754241148708978

locations2=[];

for (i=1;i<=1000;i++){

locations2.push({ lon : 28.2211, lat : 128.2828 })

}

devices2=[];

for (i=1;i<=10;i++){

devices2.push( {

n:"iphone6",

lp: ISODate(),

v: 8.1 ,

sp: true,

l10: locations.slice(10,20)

})

}

y={

_id : ObjectId(),

fn: "David",

ln: "Murphy",

bd: "Aug 16 2080",

a : "123 nowhere drive Nonya, TX, USA , 78701",

pn1: "512-555-5555",

pn2: "512-555-5556",

kl: locations2,

lc : ISODate(),

d : devices2

}

> Object.bsonsize(y)

41392

> Object.bsonsize(y)/Object.bsonsize(x)

0.754241148708978

This minor change saves space by 25%, without changing any actual data. I know you can already see things like kl or l10 and are wondering, “What the heck is that!” This is where some clever tricks with the application code can come in.

You can make a mapping collection in MongoDB, or keep it in your application code – so in the code self.l10 is renamed to self.last_10_locations. Some people go so far as using constants – for example “self.LAST_10_LOCATIONS” to “self.l10 = self.get_value(LAST_10_LOCATIONS)” – to reduce the field size.

Value Optimization

Using the same example, let’s assume we want to improve the field usage. We know we will always pull a user by their _id, or the most recent people to check-in. To help optimize this further, let us assume “x” is still our main document:

locations=[];

for (i=1;i<=1000;i++){

   locations.push({ longitude : 28.2211, latitude : 128.2828 })

}

devices=[];

for (i=1;i<=10;i++){

   devices.push( {

       name:"iphone6",

       last_ping: ISODate(),

       version: 8.1 ,

       security_pass: true,

       last_10_locations: locations.slice(10,20)

   })

}

x={

   _id : ObjectId(),

   first_name: "David",

   last_name:     "Murphy",

   birthdate:     "Aug 16 2080",

   address :     "123 nowhere drive Nonya, TX, USA , 78701",

   phone_number1:     "512-555-5555",

   phone_number2:    "512-555-5556",

   known_locations: locations,

   laat_checkin : ISODate(),

   devices : devices

}

>Object.bsonsize(x)

54879

locations=[];

for (i=1;i<=1000;i++){

locations.push({ longitude : 28.2211, latitude : 128.2828 })

}

devices=[];

for (i=1;i<=10;i++){

devices.push( {

name:"iphone6",

last_ping: ISODate(),

version: 8.1 ,

security_pass: true,

last_10_locations: locations.slice(10,20)

})

}

x={

_id : ObjectId(),

first_name: "David",

last_name: "Murphy",

birthdate: "Aug 16 2080",

address : "123 nowhere drive Nonya, TX, USA , 78701",

phone_number1: "512-555-5555",

phone_number2: "512-555-5556",

known_locations: locations,

laat_checkin : ISODate(),

devices : devices

}

>Object.bsonsize(x)

54879

But now, instead of optimizing field names, we want to optimize the values:

locations=[];

for (i=1;i<=1000;i++){

   locations.push({ longitude : 28.2211, latitude : 128.2828 })

}

devices=[];

for (i=1;i<;=10;i++){

   devices.push( {

       name:"iphone6",

       last_ping: ISODate(),

       version: 8.1 ,

       security_pass: true,

       last_10_locations: locations.slice(10,20)

   })

}

z={

   _id : ObjectId(),

   first_name: "David",

   last_name:     "Murphy",

   birthdate:     ISODate("2080-08-16T00:00:00Z"),

   address :     "123 nowhere drive Nonya, TX, USA , 78701",

   phone_number1:    5125555555,

   phone_number2:    5125555556,

   known_locations: locations,

   last_checkin : ISODate(),

   devices : devices

}

>Object.bsonsize(z)

54853

locations=[];

for (i=1;i<=1000;i++){

locations.push({ longitude : 28.2211, latitude : 128.2828 })

}

devices=[];

for (i=1;i<;=10;i++){

devices.push( {

name:"iphone6",

last_ping: ISODate(),

version: 8.1 ,

security_pass: true,

last_10_locations: locations.slice(10,20)

})

}

z={

_id : ObjectId(),

first_name: "David",

last_name: "Murphy",

birthdate: ISODate("2080-08-16T00:00:00Z"),

address : "123 nowhere drive Nonya, TX, USA , 78701",

phone_number1: 5125555555,

phone_number2: 5125555556,

known_locations: locations,

last_checkin : ISODate(),

devices : devices

}

>Object.bsonsize(z)

54853

In this example, we changed phone numbers to integers and used the “Date Type” for dates (as already done in the devices document). The savings were much smaller than earlier, coming in at only 26 bytes, but this could have a significant impact when multiplied out to many fields and documents. If we had started this example quoting the floats as many people do, we would see more of a difference. But always watch out for numbers and dates shown as strings: these almost always waste space.

When you combine both sets of savings you have:

54853- 26 - 41392 = 13435

1	54853- 26 - 41392 = 13435

That’s right: 24.5% smaller memory size on the network and for the application to parse with its CPU! Easy wins to reduce your resource needs, and to make the COO happier.

0 0 votes

Article Rating

6 Comments

Oldest

Newest Most Voted

Jai Hirsch

10 years ago

Nice post, data modeling and data optimization is always very important and should never be passed over for non-trivial implementations (in any database).

Daniel Schneller

10 years ago

I wonder: is this different depending on the storage engine used? Esp. does WT as the new default engine make any difference over MMAPv1?

Author

David Murphy

10 years ago

Reply to Daniel Schneller

Hi Daniel,

It is interesting as the answer is both yes and no depending on the point of view. I should mention we can group the SE’s into two camps MMAPv1/ In-Memory vs. WT, PerconaFT, RocksDB (later two are available in the Percona Server for Mongo builds). The points I starred in the blog post denote the areas that the engine choice could affect the outcome. However, this is only relevant to the compression algorithm selected, and some have a few available. That is your “yes” area, however, the real focus of the article was around BSON implementation and what the drivers each get back from Mongo, what Mongo keeps in memory, and so forth. The SE has little to do with this level much like networking where your choice in layer three switch types matters with layer seven networking. The lower level device can improve or compress things for routing reasons, but anything it does must be invisible to the layer seven applications. For this reason, I chose this topic, as this discussion affects all documents the same way without concern about the compression subsystem involved. It is very true if your using compression the sizes I show will not be on disk nearly as large, however in Memory/Network/Cpu universally the size would be the same as everything stores the “payload” as BSON and, in the end, is not compressed at that level. The storage engines are about what is the best way to save/get your payload for your use case. In a nutshell , no the SE’s will not matter for this discussion as they are much lower level and this is about in memory and transfer structures.

Hopefully, that helps.

Dharshan

10 years ago

Thanks for the detailed post. What if you didn’t use compression? I think if you used WT, even without using compression the field name lengths shouldn’t matter.

I agree that the field name length impacts data size in MMAPv1

Author

David Murphy

10 years ago

Reply to Dharshan

Hi,

I am sorry I just saw this in my box. Wiredtiger is still just storing the BSON structure which has the field name, its bit length, and then its content for each field. The field name is stored as a regular string so it would still consume space on each entry, compression just detects this duplication and removes it. So you would still see this effect. However, it does not add padding in the way MMAP does avoid fragmentation, so if WiredTiger has compression set to None you could still see these effects. As I mentioned, more importantly, network and memory would still have the “full” form of the document using resources you would not want to be used.