What I didn't know about MongoDB

Here are the notes I took while reading through the book http://oreilly.com/catalog/0636920001096

 

Data type

 

Ordered key/value pairs

Key/value pairs in documents are ordered—the earlier document is distinct from 

the following document: 

{"greeting" : "Hello, world!", "foo" : 3} 

{"foo" : 3, "greeting" : "Hello, world!"} 

 

Use subcollections is recommended

The MongoDB web console organizes the data in its DBTOP section by 

subcollection (see Chapter 8 for more information on administration). 

• Most drivers provide some syntactic sugar for accessing a subcollection of a given 

collection. For example, in the database shell, db.blog will give you the blog col- 

lection, and db.blog.posts will give you the blog.posts collection. 

Subcollections are a great way to organize data in MongoDB, and their use is highly 

recommended. 

 

What a built-in function is doing

A good way of figuring out what a function is doing is to type it without the parentheses. 

This will print the JavaScript source code for the function. For example, if we are curious 

about how the update function works or cannot remember the order of parameters, we 

can do the following: 

> db.foo.update 

function (query, obj, upsert, multi) { 

    assert(query, "need a query"); 

    assert(obj, "need an object"); 

    this._validateObject(obj); 

    this._mongo.update(this._fullName, query, obj, 

 

Avoid overwrite entire documents and 8-byte integer representation

JavaScript has one “number” type. Because MongoDB has three number types (4-byte 

integer, 8-byte integer, and 8-byte float), the shell has to hack around JavaScript’s lim- 

itations a bit. By default, any number in the shell is treated as a double by MongoDB. 

This means that if you retrieve a 4-byte integer from the database, manipulate its docu- 

ment, and save it back to the database even without changing the integer, the integer 

will be resaved as a floating-point number. Thus, it is generally a good idea not to 

overwrite entire documents from the shell

 

if you save an 8-byte integer and look at it in the shell, the shell will display it as an 

embedded document indicating that it might not be exact. For example, if we save a 

document with a "myInteger" key whose value is the 64-bit integer, 3, and then look at 

it in the shell, it will look like this: 

> doc = db.nums.findOne() 

    "_id" : ObjectId("4c0beecfd096a2580fe6fa08"), 

    "myInteger" : { 

        "floatApprox" : 3 

    } 

If you insert an 8-byte integer that cannot be accurately displayed as a double, the shell 

will add two keys, "top" and "bottom", containing the 32-bit integers representing the 

4 high-order bytes and 4 low-order bytes of the integer, respectively. For instance, if 

we insert 9223372036854775807, the shell will show us the following: 

> db.nums.findOne() 

    "_id" : ObjectId("4c0beecfd096a2580fe6fa09"), 

    "myInteger" : { 

        "floatApprox" : 9223372036854776000, 

        "top" : 2147483647, 

        "bottom" : 4294967295 

    } 

 

ObjectID explained

If you create multiple new ObjectIds in rapid succession, you can see that only the last 

few digits change each time. In addition, a couple of digits in the middle of the 

ObjectId will change (if you space the creations out by a couple of seconds). This is 

because of the manner in which ObjectIds are created. The 12 bytes of an ObjectId are 

generated as follows: 

0 1 2 3 4 5 6 7 8 9 10 11 

Timestamp Machine PID Increment 

The first four bytes of an ObjectId are a timestamp in seconds since the epoch. This 

provides a couple of useful properties: 

• The timestamp, when combined with the next five bytes (which will be described 

in a moment), provides uniqueness at the granularity of a second. 

• Because the timestamp comes first, it means that ObjectIds will sort in roughly 

insertion order. This is not a strong guarantee but does have some nice properties, 

such as making ObjectIds efficient to index. 

 

MongoDB’s philosophy on pushing tasks to client driver whenever possible

Although ObjectIds are designed to be lightweight and easy to generate, there is 

still some overhead involved in their generation. The decision to generate them on 

the client side reflects an overall philosophy of MongoDB: work should be pushed 

out of the server and to the drivers whenever possible. This philosophy reflects the 

fact that, even with scalable databases like MongoDB, it is easier to scale out at the 

application layer than at the database layer. Moving work to the client side reduces 

the burden requiring the database to scale. 

• By generating ObjectIds on the client side, drivers are capable of providing richer 

APIs than would be otherwise possible. For example, a driver might have its 

insert method either return the generated ObjectId or inject it directly into the 

document that was inserted. If the driver allowed the server to generate 

ObjectIds, then a separate query would be required to determine the value of 

"_id" for an inserted document. 

 

 

CRUD

 

Batch Insert 

If you have a situation where you are inserting multiple documents into a collection, 

you can make the insert faster by using batch inserts. Batch inserts allow you to pass 

an array of documents to the database. 

Sending dozens, hundreds, or even thousands of documents at a time can make inserts 

significantly faster. A batch insert is a single TCP request, meaning that you do not 

incur the overhead of doing hundreds of individual requests. It can also cut insert time 

by eliminating a lot of the header processing that gets done for each message. When 

an individual document is sent to the database, it is prefixed by a header that tells the 

database to do an insert operation on a certain collection. By using batch insert, the 

database doesn’t need to reprocess this information for each document. 

 

Question: if 1 out of 1000 insertion failed due to conflicts, will the rest 999 still succeed? 

 

Fire-and-forget update by default

Updates are atomic: if two updates happen at the same time, whichever one reaches 

the server first will be applied, and then the next one will be applied. Thus, conflicting 

updates can safely be sent in rapid-fire succession without any documents being cor- 

rupted: the last update will “win.” 

 

The three operations that this chapter focused on (inserts, removes, and updates) seem 

instantaneous because none of them waits for a database response. They are not asyn- 

chronous; they can be thought of as “fire-and-forget” functions: the client sends the 

documents to the server and immediately continues. The client never receives an “OK, 

got that” or a “not OK, could you send that again?” response. 

The benefit to this is that the speed at which you can perform these operations is terrific. 

You are often only limited by the speed at which your client can send them and the 

speed of your network. 

 

Aware of $push becoming bottleneck

Using "$push" and other array modifiers is encouraged and often necessary, but it is 

good to keep in mind the trade-offs of such updates. If "$push" becomes a bottleneck, 

it may be worth pulling an embedded array out into a separate collection. 

 

The save Shell Helper 

save is a shell function that lets you insert a document if it doesn’t exist and update it 

if it does. It takes one argument: a document. If the document contains an "_id" key, 

save will do an upsert. Otherwise, it will do an insert. This is just a convenience function 

so that programmers can quickly modify documents in the shell: 

> var x = db.foo.findOne() 

> x.num = 42 

42 

> db.foo.save(x) 

Without save, the last line would have been a more cumbersome 

db.foo.update({"_id" : x._id}, x). 

 

Multiupdate 

Multiupdates are a great way of performing schema migrations or rolling out new fea- 

tures to certain users. Suppose, for example, we want to give a gift to every user who 

has a birthday on a certain day. We can use multiupdate to add a "gift" to their account: 

> db.users.update({birthday : "10/13/1978"}, 

... {$set : {gift : "Happy Birthday!"}}, false, true) 

This would add the "gift" key to all user documents with birthdays on October 13, 

1978. 

 

To see the number of documents updated by a multiple update, you can run the 

getLastError database command (which might be better named "getLastOpStatus"). 

The "n" key will contain the number of documents affected by the update: 

> db.count.update({x : 1}, {$inc : {x : 1}}, false, true) 

> db.runCommand({getLastError : 1}) 

    "err" : null, 

    "updatedExisting" : true, 

    "n" : 5, 

    "ok" : true 

 

 

Query snapshot and connection pools

 

For each connection to a MongoDB server, the database creates a queue for that con- 

nection’s requests. When the client sends a request, it will be placed at the end of its 

connection’s queue. Any subsequent requests on the connection will occur after the 

enqueued operation is processed. Thus, a single connection has a consistent view of 

the database and can always read its own writes. 

Note that this is a per-connection queue: if we open two shells, we will have two con- 

nections to the database. If we perform an insert in one shell, a subsequent query in 

the other shell might not return the inserted document. However, within a single shell, 

if we query for the document after inserting, the document will be returned. This be- 

havior can be difficult to duplicate by hand, but on a busy server, interleaved inserts/ 

queries are very likely to occur. Often developers run into this when they insert data in 

one thread and then check that it was successfully inserted in another. For a second or 

two, it looks like the data was not inserted, and then it suddenly appears. 

This behavior is especially worth keeping in mind when using the Ruby, Python, and 

Java drivers, because all three drivers use connection pooling. For efficiency, these 

drivers open multiple connections (a pool) to the server and distribute requests across 

them. 

 

 

Querying

 

Limit the fields returned

Sometimes, you do not need all of the key/value pairs in a document returned. If this 

is the case, you can pass a second argument to find (or findOne) specifying the keys you 

want. This reduces both the amount of data sent over the wire and the time and memory 

used to decode documents on the client side. 

For example, if you have a user collection and you are interested only in the "user 

name" and "email" keys, you could return just those keys with the following query: 

> db.users.find({}, {"username" : 1, "email" : 1}) 

    "_id" : ObjectId("4ba0f0dfd22aa494fd523620"), 

    "username" : "joe", 

    "email" : "joe@example.com

As you can see from the previous output, the "_id" key is always returned, even if it 

isn’t specifically listed. 

 

Cursor chains and load loading

When you call find, the shell does not query the database immediately. It waits until 

you actually start requesting results to send the query, which allows you to chain ad- 

ditional options onto a query before it is performed. Almost every method on a cursor 

object returns the cursor itself so that you can chain them in any order. For instance, 

all of the following are equivalent: 

> var cursor = db.foo.find().sort({"x" : 1}).limit(1).skip(10); 

> var cursor = db.foo.find().limit(1).sort({"x" : 1}).skip(10); 

> var cursor = db.foo.find().skip(10).limit(1).sort({"x" : 1}); 

At this point, the query has not been executed yet. All of these functions merely build 

the query. Now, suppose we call the following: 

> cursor.hasNext() 

 

Index

 

Avoiding Large Skips 

Using skip for a small number of documents is fine. For a large number of results, 

skip can be slow (this is true in nearly every database, not just MongoDB) and should 

be avoided. Usually you can build criteria into the documents themselves to avoid 

having to do large skips, or you can calculate the next query based on the result from 

the previous one. 

Paginating results without skip 

The easiest way to do pagination is to return the first page of results using limit and 

then return each subsequent page as an offset from the beginning. 

> // do not use: slow for large skips 

> var page1 = db.foo.find(criteria).limit(100) 

> var page2 = db.foo.find(criteria).skip(100).limit(100) 

> var page3 = db.foo.find(criteria).skip(200).limit(100) 

... 

However, depending on your query, you can usually find a way to paginate without 

skips. For example, suppose we want to display documents in descending order based 

on "date". We can get the first page of results with the following: 

> var page1 = db.foo.find().sort({"date" : -1}).limit(100) 

Then, we can use the "date" value of the last document as the criteria for fetching the 

next page: 

 

Key metrics for query performance

"nscanned" : 64 

This is the number of documents that the database looked through. You want to 

make sure this is as close to the number returned as possible. 

"n" : 64 

This is the number of documents returned. We’re doing pretty well here, because 

the number of documents scanned exactly matches the number returned. Of 

course, given that we’re returning the entire collection, it would be difficult to do 

otherwise. 

"millis" : 0 

The number of milliseconds it took the database to execute the query. 0 is a good 

time to shoot for. 

 

MongoDB query optimizer and parallel query plan execution model

MongoDB has a query optimizer and is very clever about 

choosing which index to use. When you first do a query, the query optimizer tries out 

a number of query plans concurrently. The first one to finish will be used, and the rest 

of the query executions are terminated. That query plan will be remembered for future 

queries on the same keys. The query optimizer periodically retries other plans, in case 

you’ve added new data and the previously chosen plan is no longer best. The only part 

you should need to worry about is giving the query optimizer useful indexes to choose 

from. 

 

 

Aggregation

 

Counting the total number of documents in a collection is fast regardless of collection 

size. 

 

 

Advanced Topics

 

Capped collection use cases and benefits

First, inserts into a capped collection are extremely 

fast. When doing an insert, there is never a need to allocate additional space, and the 

server never needs to search through a free list to find the right place to put a document. 

The inserted document can always be placed directly at the “tail” of the collection, 

overwriting old documents if needed. By default, there are also no indexes to update 

on an insert, so an insert is essentially a single memcpy. 

Another interesting property of capped collections is that queries retrieving documents 

in insertion order are very fast. Because documents are always stored in insertion order, 

queries for documents in that order just walk over the collection, returning documents 

in the exact order that they appear on disk. By default, any find performed on a capped 

collection will always return results in insertion order. 

 

When to use DBRefs

In short, the best times to use DBRefs are when you’re storing heterogeneous references 

to documents in different collections, like in the previous example or when you want 

to take advantage of some additional DBRef-specific functionality in a driver or tool. 

Otherwise, it’s generally best to just store an "_id" and use that as a reference, because 

that representation tends to be more compact and easier to work with. 

 

 

Administration

 

Backing up from Slave is recommended

Backing up from a slave is the recommended way to handle data backups with MongoDB. 

 

What happens when “repairing a database”

The underlying process of repairing a database is actually pretty easy to understand: all of the documents in the database are exported and 

then immediately imported, ignoring any that are invalid. After that is complete, all 

indexes are rebuilt. Understanding this mechanism explains some of the properties of 

repair. It can take a long time for large data sets, because all of the data is validated and 

all indexes are rebuilt. Repairing can also leave a database with fewer documents than 

it had before the corruption originally occurred, because any corrupt documents are 

simply ignored. 

Repairing a database will also perform a compaction. Any extra free 

space (which might exist after dropping large collections or removing 

large number of documents, for example) will be reclaimed after a 

repair. 

 

 

 

Replication

 

Replica Sets vs Master Slave

A replica set is basically a master-slave cluster with automatic failover. The biggest 

difference between a master-slave cluster and a replica set is that a replica set does not 

have a single master: one is elected by the cluster and may change to another node if 

the current master goes down. However, they look very similar: a replica set always has 

a single master node (called a primary) and one or more slaves (called secondaries).

 

The nice thing about replica sets is how automatic everything is. First, the set itself does 

a lot of the administration for you, promoting slaves automatically and making sure 

you won’t run into inconsistencies. 

 

Read Scaling 

Scaling out reads with slaves is easy: just set up master-slave replication like usual, and 

make connections directly to the slave servers to handle queries. The only trick is that 

there is a special query option to tell a slave server that it is allowed to handle a query. 

(By default, queries will not be executed on a slave.) This option is called slaveOkay, 

and all MongoDB drivers provide a mechanism for setting it. Some drivers also provide 

facilities to automate the process of distributing queries to slaves—this varies on a per- 

driver basis, however.

 

Using Slaves for Data Processing 

Another interesting technique is to use slaves as a mechanism for offloading intensive 

processing or aggregation to avoid degrading performance on the master. To do this, 

start a normal slave, but with the addition of the --master command-line argument. 

Starting with both --slave and --master may seem like a bit of a paradox. What it 

means, however, is that you’ll be able to write to the slave, query on it like usual, and 

basically treat it like you would a normal MongoDB master node. In addition, the slave 

will continue to replicate data from the actual master. This way, you can perform 

blocking operations on the slave without ever affecting the performance of the master 

node. 

When using this technique, you should be sure never to write to any 

database on the slave that is being replicated from the master. The slave 

will not revert any such writes in order to properly mirror the master. 

The slave should also not have any of the databases that are being re- 

plicated when it first starts up. If it does, those databases will not ever 

be fully synced but will just update with new operations. 

 

About

A Programming Artist believes in Minimalism. CTO of http://trunk.ly/. Proud owner of vim, zsh, and wikiReader. A man without a mobile phone.

http://alexdong.com/
http://twitter.com/alexdong/
http://trunk.ly/alexdong/

TwitterFacebook