Better Bulking for MongoDB 2.6 & Beyond

TL;DR: Be careful using the new Bulk API

There are two new features in MongoDB 2.6 that offer potentially huge performance improvements - the new write protocol and the Bulk operations support in Mongo.shell (and other drivers). The two features are mentioned together in the release notes, but that close association could lead to confusion and disappointment unless you understand how these new features work. Allow us to explain...

The Promise

The new write protocol in MongoDB 2.6 allows clients to send batches of commands to the server and have them executed on the server which then returns a document with details of how those operations went, what was processed, and if something went wrong, what it was and where processing stopped. It goes beyond MongoDB 2.4's bulk insert operation and brings it to all write operations. Meanwhile in the Mongo shell – and in the updated drivers that accompanied MongoDB 2.6's release – the new Bulk() API appears to reflect this new capability with examples such as this:

var bulk = db.items.initializeOrderedBulkOp();  
bulk.insert( { items: "abc123", status: "A", defaultQty: 500, points: 5 } );  
bulk.insert( { items: "ijk123", status: "A", defaultQty: 100, points: 10 } );  
bulk.find( { status: "D" } ).removeOne();  
bulk.find( { status: "D" } ).update( { $set: { points: 0 } } );  
bulk.execute();  

from the Bulk.execute() reference page. This would appear to show how an ordered bulk operation can be built that does two inserts, removes a document and then updates another and, once built is executed, presumably by the server. The same API exists for the Node.js driver and with an API that looks like this, the first thing any developer would want to do is to start bulking up their database write operations to get to benefits of less round trips to the MongoDB server.

The Actuality

There's only one catch – it doesn't work like that. If we go back to the new write protocol we can see that there's three classes of write operations that can be done, insert, update and delete and each has its own command in the write protocol. What's new is that update and remove have now joined insert in having the ability to carry out multiple versions of the same class of operation. So it is possible to bulk execute multiple inserts, multiple updates or multiple deletes. There's also an expanded results document returned by these operations which gives a richer picture of the success or failure of the particular bulk operation. But currently, and most importantly, there's no support for complex mixed bulk operations in the protocol – to get there would require a much more extensive reworking of the write protocol.

So what MongoDB has done is looked forward to a time when mixed bulk operations would be possible and envisioned an API which could use that, then they have written the code in the library that maps that expectation to the current, or past, protocol realities. It's a transitional compromise which makes sense, but it is important to know what happens behind the scenes to get best performance and avoid disappointment.

The Impact

With the Bulk API in Mongo shell and the Node.js driver, as each operation is added, if it is of the same type as the previous operation, it gets added to a batch of operations of that type. Otherwise, it is used to create a new batch. When execute() is called, these batches are run in the order they were created.

bulk

Even if an unordered Bulk() operation is selected, these batches are still run in creation order. The ordered flag is passed down to the particular batch commands being executed. This also means that if you write code now which is an unordered bulk operation, and it currently works well, make sure that you are not relying on the implied ordering. This may change in the future in a later release where the whole set of operations could be unordered to exploit parallelism. If in doubt it may be better to play it safe and stick with ordered bulk operations. Note that ordered/unordered does control how errors are handled, with ordered stopping on the first error it finds, while unordered will process all operations returning a list of errors, related operations, and documents that caused the error.

Taking that MongoDB example from earlier, the four operations would make three batches: one insert batch with two operations in it, an update batch with one, and a remove batch with one, cutting the number of times getLastError is called from four to three. If one of the insert operations was swapped in order with the remove operation, we'd be back to four times as four batches would be created.

Similar caveats apply to all the MongoDB official drivers and their support for the new write protocol. For example, in the Ruby driver the operations held as a single collection and broken up into batches during execution while the C driver creates batches as part of its command writing process. The principle is the same though - the bulk operations API for each has been created for a future where mixed bulk operations will work, but for MongoDB 2.6 and the underlying write protocol, this means knowing that commands will be batched up.

The Win

Before you get the feeling there's aren't any gains to be found with the Bulk API, there are. When writing code, as long as you stick to adding large batches of the same class of operations, you can be sure of matching your code with the underlying write protocol. In a synthetic micro-benchmark, updating 1000 records without the bulk API we see around 30 operations per second with most of that time taken up with the "send update/request last error" loop. With the Bulk API we got an estimated 4000 operations per second (and with 100,000 records being updated, around 5000 operations per second). It's quite a boost for doing large numbers of operations, as long as the class of operation is the same. But remember, you can mix operations, but you must take care to avoid generating up, say a pattern of insert/update/insert/update... as this will nullify all the performance gains. A rule of thumb is to use the Bulk API's more attractive semantics to only execute same class operations. That way, you will be assured of good performance.

This support exists in all the MongoDB official drivers and it does work with older servers, but a word of warning from the documentation bears repeating - the API is designed for version 2.6 servers and later. On earlier versions of MongoDB servers, it works in a compatible mode, issuing each operation one at a time which can be very slow with all the round tripping. There are some optimisations outside the Bulk API which can get your performance up and we'll cover those in a later blog post.