One doesn’t have to look far to see that there is strong interest in MongoDB compression. MongoDB has an open ticket from 2009 titled “Option to Store Data Compressed” with Fix Version/s planned but not scheduled. The ticket has a lot of comments, mostly from MongoDB users explaining their use-cases for the feature. For example, Khalid Salomão notes that “Compression would be very good to reduce storage cost and improve IO performance” and Andy notes that “SSD is getting more and more common for servers. They are very fast. The problems are high costs and low capacity.” There are many more in the ticket.
In prior blogs we’ve written about significant performance advantages when using Fractal Tree Indexes with MongoDB. Compression has always been a key feature of Fractal Tree Indexes. We currently support the LZMA, quicklz, and zlib compression algorithms, and our architecture allows us to easily add more. Our large block size creates another advantage as these algorithms tend to compress large blocks better than small ones.
Given the interest in compression for MongoDB and our capabilities to address this functionality, we decided to do a benchmark to measure the compression achieved by MongoDB + Fractal Tree Indexes using each available compression type. The benchmark loads 51 million documents into a collection and measures the size of all files in the file system (–dbpath).
The structure of each document is as follows:
|
1 |
{"URI" : "ilabdoor5981234" ,<br>"name" : "weather78123123",<br>"origin" : "core-mon2341341",<br>"creation" : <nanotime>,<br>"expiration" : <nanotime>,<br>"data" : "tokutek",<br>"randomStrings" : <256 bytes of random 5-character strings<br> each ending in space>} |
If you’d like to run it yourself, the benchmark application is available here.
Benchmark Results
Compared to MongoDB’s file system size (46.43GB), our quicklz implementation is 62.68% smaller (17.33GB), zlib is 69.69% smaller (14.10GB), and lzma is 71.20% smaller (13.37GB).
The obvious benefit of high compression is smaller on-disk size: buy less disk/flash, faster file system backups, smaller EC2 instances. Not so obvious is IO efficiency. Since all reads and writes utilize compressed data then each IO is more efficient, often performing 5x to 20x the work of an uncompressed IO operation.
We will continue to share our results with the community and get people’s thoughts on applications where this might help, suggestions for next steps, and any other feedback. Please drop us a line if you are interested in becoming a beta tester.
We’re at Strata this week in the Innovation Pavilion. Please swing by to learn more if you are there.
Resources
RELATED POSTS