What I didn't know about MongoDB
Here are the notes I took while reading through the book http://oreilly.com/catalog/0636920001096
Data type
Ordered key/value pairs
Key/value pairs in documents are ordered—the earlier document is distinct from
the following document:
{"greeting" : "Hello, world!", "foo" : 3}
{"foo" : 3, "greeting" : "Hello, world!"}
Use subcollections is recommended
The MongoDB web console organizes the data in its DBTOP section by
subcollection (see Chapter 8 for more information on administration).
• Most drivers provide some syntactic sugar for accessing a subcollection of a given
collection. For example, in the database shell, db.blog will give you the blog col-
lection, and db.blog.posts will give you the blog.posts collection.
Subcollections are a great way to organize data in MongoDB, and their use is highly
recommended.
What a built-in function is doing
A good way of figuring out what a function is doing is to type it without the parentheses.
This will print the JavaScript source code for the function. For example, if we are curious
about how the update function works or cannot remember the order of parameters, we
can do the following:
> db.foo.update
function (query, obj, upsert, multi) {
assert(query, "need a query");
assert(obj, "need an object");
this._validateObject(obj);
this._mongo.update(this._fullName, query, obj,
Avoid overwrite entire documents and 8-byte integer representation
JavaScript has one “number” type. Because MongoDB has three number types (4-byte
integer, 8-byte integer, and 8-byte float), the shell has to hack around JavaScript’s lim-
itations a bit. By default, any number in the shell is treated as a double by MongoDB.
This means that if you retrieve a 4-byte integer from the database, manipulate its docu-
ment, and save it back to the database even without changing the integer, the integer
will be resaved as a floating-point number. Thus, it is generally a good idea not to
overwrite entire documents from the shell
if you save an 8-byte integer and look at it in the shell, the shell will display it as an
embedded document indicating that it might not be exact. For example, if we save a
document with a "myInteger" key whose value is the 64-bit integer, 3, and then look at
it in the shell, it will look like this:
> doc = db.nums.findOne()
{
"_id" : ObjectId("4c0beecfd096a2580fe6fa08"),
"myInteger" : {
"floatApprox" : 3
}
}
If you insert an 8-byte integer that cannot be accurately displayed as a double, the shell
will add two keys, "top" and "bottom", containing the 32-bit integers representing the
4 high-order bytes and 4 low-order bytes of the integer, respectively. For instance, if
we insert 9223372036854775807, the shell will show us the following:
> db.nums.findOne()
{
"_id" : ObjectId("4c0beecfd096a2580fe6fa09"),
"myInteger" : {
"floatApprox" : 9223372036854776000,
"top" : 2147483647,
"bottom" : 4294967295
}
}
ObjectID explained
If you create multiple new ObjectIds in rapid succession, you can see that only the last
few digits change each time. In addition, a couple of digits in the middle of the
ObjectId will change (if you space the creations out by a couple of seconds). This is
because of the manner in which ObjectIds are created. The 12 bytes of an ObjectId are
generated as follows:
0 1 2 3 4 5 6 7 8 9 10 11
Timestamp Machine PID Increment
The first four bytes of an ObjectId are a timestamp in seconds since the epoch. This
provides a couple of useful properties:
• The timestamp, when combined with the next five bytes (which will be described
in a moment), provides uniqueness at the granularity of a second.
• Because the timestamp comes first, it means that ObjectIds will sort in roughly
insertion order. This is not a strong guarantee but does have some nice properties,
such as making ObjectIds efficient to index.
MongoDB’s philosophy on pushing tasks to client driver whenever possible
Although ObjectIds are designed to be lightweight and easy to generate, there is
still some overhead involved in their generation. The decision to generate them on
the client side reflects an overall philosophy of MongoDB: work should be pushed
out of the server and to the drivers whenever possible. This philosophy reflects the
fact that, even with scalable databases like MongoDB, it is easier to scale out at the
application layer than at the database layer. Moving work to the client side reduces
the burden requiring the database to scale.
• By generating ObjectIds on the client side, drivers are capable of providing richer
APIs than would be otherwise possible. For example, a driver might have its
insert method either return the generated ObjectId or inject it directly into the
document that was inserted. If the driver allowed the server to generate
ObjectIds, then a separate query would be required to determine the value of
"_id" for an inserted document.
CRUD
Batch Insert
If you have a situation where you are inserting multiple documents into a collection,
you can make the insert faster by using batch inserts. Batch inserts allow you to pass
an array of documents to the database.
Sending dozens, hundreds, or even thousands of documents at a time can make inserts
significantly faster. A batch insert is a single TCP request, meaning that you do not
incur the overhead of doing hundreds of individual requests. It can also cut insert time
by eliminating a lot of the header processing that gets done for each message. When
an individual document is sent to the database, it is prefixed by a header that tells the
database to do an insert operation on a certain collection. By using batch insert, the
database doesn’t need to reprocess this information for each document.
Question: if 1 out of 1000 insertion failed due to conflicts, will the rest 999 still succeed?
Fire-and-forget update by default
Updates are atomic: if two updates happen at the same time, whichever one reaches
the server first will be applied, and then the next one will be applied. Thus, conflicting
updates can safely be sent in rapid-fire succession without any documents being cor-
rupted: the last update will “win.”
The three operations that this chapter focused on (inserts, removes, and updates) seem
instantaneous because none of them waits for a database response. They are not asyn-
chronous; they can be thought of as “fire-and-forget” functions: the client sends the
documents to the server and immediately continues. The client never receives an “OK,
got that” or a “not OK, could you send that again?” response.
The benefit to this is that the speed at which you can perform these operations is terrific.
You are often only limited by the speed at which your client can send them and the
speed of your network.
Aware of $push becoming bottleneck
Using "$push" and other array modifiers is encouraged and often necessary, but it is
good to keep in mind the trade-offs of such updates. If "$push" becomes a bottleneck,
it may be worth pulling an embedded array out into a separate collection.
The save Shell Helper
save is a shell function that lets you insert a document if it doesn’t exist and update it
if it does. It takes one argument: a document. If the document contains an "_id" key,
save will do an upsert. Otherwise, it will do an insert. This is just a convenience function
so that programmers can quickly modify documents in the shell:
> var x = db.foo.findOne()
> x.num = 42
42
> db.foo.save(x)
Without save, the last line would have been a more cumbersome
db.foo.update({"_id" : x._id}, x).
Multiupdate
Multiupdates are a great way of performing schema migrations or rolling out new fea-
tures to certain users. Suppose, for example, we want to give a gift to every user who
has a birthday on a certain day. We can use multiupdate to add a "gift" to their account:
> db.users.update({birthday : "10/13/1978"},
... {$set : {gift : "Happy Birthday!"}}, false, true)
This would add the "gift" key to all user documents with birthdays on October 13,
1978.
To see the number of documents updated by a multiple update, you can run the
getLastError database command (which might be better named "getLastOpStatus").
The "n" key will contain the number of documents affected by the update:
> db.count.update({x : 1}, {$inc : {x : 1}}, false, true)
> db.runCommand({getLastError : 1})
{
"err" : null,
"updatedExisting" : true,
"n" : 5,
"ok" : true
}
Query snapshot and connection pools
For each connection to a MongoDB server, the database creates a queue for that con-
nection’s requests. When the client sends a request, it will be placed at the end of its
connection’s queue. Any subsequent requests on the connection will occur after the
enqueued operation is processed. Thus, a single connection has a consistent view of
the database and can always read its own writes.
Note that this is a per-connection queue: if we open two shells, we will have two con-
nections to the database. If we perform an insert in one shell, a subsequent query in
the other shell might not return the inserted document. However, within a single shell,
if we query for the document after inserting, the document will be returned. This be-
havior can be difficult to duplicate by hand, but on a busy server, interleaved inserts/
queries are very likely to occur. Often developers run into this when they insert data in
one thread and then check that it was successfully inserted in another. For a second or
two, it looks like the data was not inserted, and then it suddenly appears.
This behavior is especially worth keeping in mind when using the Ruby, Python, and
Java drivers, because all three drivers use connection pooling. For efficiency, these
drivers open multiple connections (a pool) to the server and distribute requests across
them.
Querying
Limit the fields returned
Sometimes, you do not need all of the key/value pairs in a document returned. If this
is the case, you can pass a second argument to find (or findOne) specifying the keys you
want. This reduces both the amount of data sent over the wire and the time and memory
used to decode documents on the client side.
For example, if you have a user collection and you are interested only in the "user
name" and "email" keys, you could return just those keys with the following query:
> db.users.find({}, {"username" : 1, "email" : 1})
{
"_id" : ObjectId("4ba0f0dfd22aa494fd523620"),
"username" : "joe",
"email" : "joe@example.com"
}
As you can see from the previous output, the "_id" key is always returned, even if it
isn’t specifically listed.
Cursor chains and load loading
When you call find, the shell does not query the database immediately. It waits until
you actually start requesting results to send the query, which allows you to chain ad-
ditional options onto a query before it is performed. Almost every method on a cursor
object returns the cursor itself so that you can chain them in any order. For instance,
all of the following are equivalent:
> var cursor = db.foo.find().sort({"x" : 1}).limit(1).skip(10);
> var cursor = db.foo.find().limit(1).sort({"x" : 1}).skip(10);
> var cursor = db.foo.find().skip(10).limit(1).sort({"x" : 1});
At this point, the query has not been executed yet. All of these functions merely build
the query. Now, suppose we call the following:
> cursor.hasNext()
Index
Avoiding Large Skips
Using skip for a small number of documents is fine. For a large number of results,
skip can be slow (this is true in nearly every database, not just MongoDB) and should
be avoided. Usually you can build criteria into the documents themselves to avoid
having to do large skips, or you can calculate the next query based on the result from
the previous one.
Paginating results without skip
The easiest way to do pagination is to return the first page of results using limit and
then return each subsequent page as an offset from the beginning.
> // do not use: slow for large skips
> var page1 = db.foo.find(criteria).limit(100)
> var page2 = db.foo.find(criteria).skip(100).limit(100)
> var page3 = db.foo.find(criteria).skip(200).limit(100)
...
However, depending on your query, you can usually find a way to paginate without
skips. For example, suppose we want to display documents in descending order based
on "date". We can get the first page of results with the following:
> var page1 = db.foo.find().sort({"date" : -1}).limit(100)
Then, we can use the "date" value of the last document as the criteria for fetching the
next page:
Key metrics for query performance
"nscanned" : 64
This is the number of documents that the database looked through. You want to
make sure this is as close to the number returned as possible.
"n" : 64
This is the number of documents returned. We’re doing pretty well here, because
the number of documents scanned exactly matches the number returned. Of
course, given that we’re returning the entire collection, it would be difficult to do
otherwise.
"millis" : 0
The number of milliseconds it took the database to execute the query. 0 is a good
time to shoot for.
MongoDB query optimizer and parallel query plan execution model
MongoDB has a query optimizer and is very clever about
choosing which index to use. When you first do a query, the query optimizer tries out
a number of query plans concurrently. The first one to finish will be used, and the rest
of the query executions are terminated. That query plan will be remembered for future
queries on the same keys. The query optimizer periodically retries other plans, in case
you’ve added new data and the previously chosen plan is no longer best. The only part
you should need to worry about is giving the query optimizer useful indexes to choose
from.
Aggregation
Counting the total number of documents in a collection is fast regardless of collection
size.
Advanced Topics
Capped collection use cases and benefits
First, inserts into a capped collection are extremely
fast. When doing an insert, there is never a need to allocate additional space, and the
server never needs to search through a free list to find the right place to put a document.
The inserted document can always be placed directly at the “tail” of the collection,
overwriting old documents if needed. By default, there are also no indexes to update
on an insert, so an insert is essentially a single memcpy.
Another interesting property of capped collections is that queries retrieving documents
in insertion order are very fast. Because documents are always stored in insertion order,
queries for documents in that order just walk over the collection, returning documents
in the exact order that they appear on disk. By default, any find performed on a capped
collection will always return results in insertion order.
When to use DBRefs
In short, the best times to use DBRefs are when you’re storing heterogeneous references
to documents in different collections, like in the previous example or when you want
to take advantage of some additional DBRef-specific functionality in a driver or tool.
Otherwise, it’s generally best to just store an "_id" and use that as a reference, because
that representation tends to be more compact and easier to work with.
Administration
Backing up from Slave is recommended
Backing up from a slave is the recommended way to handle data backups with MongoDB.
What happens when “repairing a database”
The underlying process of repairing a database is actually pretty easy to understand: all of the documents in the database are exported and
then immediately imported, ignoring any that are invalid. After that is complete, all
indexes are rebuilt. Understanding this mechanism explains some of the properties of
repair. It can take a long time for large data sets, because all of the data is validated and
all indexes are rebuilt. Repairing can also leave a database with fewer documents than
it had before the corruption originally occurred, because any corrupt documents are
simply ignored.
Repairing a database will also perform a compaction. Any extra free
space (which might exist after dropping large collections or removing
large number of documents, for example) will be reclaimed after a
repair.
Replication
Replica Sets vs Master Slave
A replica set is basically a master-slave cluster with automatic failover. The biggest
difference between a master-slave cluster and a replica set is that a replica set does not
have a single master: one is elected by the cluster and may change to another node if
the current master goes down. However, they look very similar: a replica set always has
a single master node (called a primary) and one or more slaves (called secondaries).
The nice thing about replica sets is how automatic everything is. First, the set itself does
a lot of the administration for you, promoting slaves automatically and making sure
you won’t run into inconsistencies.
Read Scaling
Scaling out reads with slaves is easy: just set up master-slave replication like usual, and
make connections directly to the slave servers to handle queries. The only trick is that
there is a special query option to tell a slave server that it is allowed to handle a query.
(By default, queries will not be executed on a slave.) This option is called slaveOkay,
and all MongoDB drivers provide a mechanism for setting it. Some drivers also provide
facilities to automate the process of distributing queries to slaves—this varies on a per-
driver basis, however.
Using Slaves for Data Processing
Another interesting technique is to use slaves as a mechanism for offloading intensive
processing or aggregation to avoid degrading performance on the master. To do this,
start a normal slave, but with the addition of the --master command-line argument.
Starting with both --slave and --master may seem like a bit of a paradox. What it
means, however, is that you’ll be able to write to the slave, query on it like usual, and
basically treat it like you would a normal MongoDB master node. In addition, the slave
will continue to replicate data from the actual master. This way, you can perform
blocking operations on the slave without ever affecting the performance of the master
node.
When using this technique, you should be sure never to write to any
database on the slave that is being replicated from the master. The slave
will not revert any such writes in order to properly mirror the master.
The slave should also not have any of the databases that are being re-
plicated when it first starts up. If it does, those databases will not ever
be fully synced but will just update with new operations.