How to save 200% RAM by selecting the right key data type for #MongoDB

At trunk.ly, we need to store and process more than 100,000 links every day. Each url is represented as the md5 value of the url. Recently, we start to notice an increasing page fault in mongostat and performance starts to degrade.  After investigation, we have realized that we can no longer keep all index in RAM anymore. 

In this post, I am going to show given 10 million md5 records, how 100% to 200% memory can be saved by adopting a slightly different data type as the index.

Here are the python code to create 1,000,000 records with 4 different index type: ObjectId, int, md5 string and base64 binary string:

#!/usr/bin/env python

import pymongo
import bson
from pymongo import Connection

db = connection.test_database

print('ObjectID')
for i in range(1, 1000000):
    db.objectids.insert({'i': i})

print('int')
for i in range(1, 1000000):
    db.ints.insert({'_id': i, 'i': i})

print('Base64 BSON')
for i in range(1, 1000000):
    db.base64s.insert({'_id': \
        bson.Binary(hashlib.md5(str(i)).digest(), 
        bson.binary.MD5_SUBTYPE), 'i': i})

print('string')
for i in range(1, 1000000):
    db.strings.insert({'_id': hashlib.md5(str(i)).digest(), 'i': i})

Here are the mongo index status we get for each index type:

> db.base64s.stats()
{
        "totalIndexSize" : 67076096,
}
> db.objectids.stats()
{
        "totalIndexSize" : 41598976,
}
> db.ints.stats()
{
        "totalIndexSize" : 32522240,
}
> db.strings.stats()
{
        "totalIndexSize" : 90914816,

}

About

A Programming Artist believes in Minimalism. CTO of http://trunk.ly/. Proud owner of vim, zsh, and wikiReader. A man without a mobile phone.

http://alexdong.com/
http://twitter.com/alexdong/
http://trunk.ly/alexdong/

TwitterFacebook