How to save 200% RAM by selecting the right key data type for #MongoDB
At trunk.ly, we need to store and process more than 100,000 links every day. Each url is represented as the md5 value of the url. Recently, we start to notice an increasing page fault in mongostat and performance starts to degrade. After investigation, we have realized that we can no longer keep all index in RAM anymore.
In this post, I am going to show given 10 million md5 records, how 100% to 200% memory can be saved by adopting a slightly different data type as the index.

Here are the python code to create 1,000,000 records with 4 different index type: ObjectId, int, md5 string and base64 binary string:
#!/usr/bin/env python
import pymongo
import bson
from pymongo import Connection
db = connection.test_database
print('ObjectID')
for i in range(1, 1000000):
db.objectids.insert({'i': i})
print('int')
for i in range(1, 1000000):
db.ints.insert({'_id': i, 'i': i})
print('Base64 BSON')
for i in range(1, 1000000):
db.base64s.insert({'_id': \
bson.Binary(hashlib.md5(str(i)).digest(),
bson.binary.MD5_SUBTYPE), 'i': i})
print('string')
for i in range(1, 1000000):
db.strings.insert({'_id': hashlib.md5(str(i)).digest(), 'i': i})Here are the mongo index status we get for each index type:
> db.base64s.stats()
{
"totalIndexSize" : 67076096,
}
> db.objectids.stats()
{
"totalIndexSize" : 41598976,
}
> db.ints.stats()
{
"totalIndexSize" : 32522240,
}
> db.strings.stats()
{
"totalIndexSize" : 90914816,
}