Arrays and Replication: A MongoDB Performance Pitfall

In this Write Stuff article, Gigi Sayfan tells the tale of a performance problem with MongoDB. It's one of those problems that hides really well and comes out in production...

When you design a system that deals with a lot of data you need to understand how your database operates at a pretty low-level and map that understanding to your design decisions. It's only through working in the trenches that you grok what the important aspects of a database are. There are a lot of MongoDB war stories out there. Mine tells how the design of MongoDB's replication and the oplog impact your ingestion rates and data modeling.

The system in question received a constant stream of data from 50 devices every second. When I joined the team the system was maxed out and couldn't handle all the incoming data. We had to scale it 10X within several months. MongoDB was our database. We ran the latest version, which was 2.2 at the time. The configuration was a replica set with one primary and two secondaries.

Before, I continue with the story let's take a quick detour and talk briefly about some relevant MongoDB concepts and its architecture. MongoDB is of course a distributed database. It distributes the data automatically for you with a mechanism called replication. The nodes of a mongo cluster are organized in replica sets. Each replica set consists of a single primary and one or more secondaries. All writes go the primary. The secondaries replicate asynchronously from the primary and can satisfy reads. There is a lot of complex machinery involved with fail over, elections and different configurations of members. You can read all about it here: Mongo DB Replication (version 2.2)

Right now, I want to focus only on the replication process itself. MongoDB stores the data in collections which contain documents. There is no schema. Each document is an arbitrary JSON object stored in a special format called BSON.

MongoDB also stores all its metadata as documents in the 'system.*' collections of each database. Every operation that modifies the data on the primary is stored in a special capped collection called oplog (short for operations log). The oplog is asynchronously synchronized to all the secondaries which replay the operations to eventually be consistent with the primary. One interesting property of the oplog is that it is idempotent. That means the same operation can be performed multiple times and the result will be as if it was performed just once. This will play a key role later. Just remember it for now.

A replica set provides redundancy and allows scaling of read operations by having more secondaries to handle reads. It also provides overall availability and automatic failover if the primary becomes unavailable through a sophisticated mechanism of leader election where a secondary can become a primary.

Note, that if you want to scale writes then MongoDB provides sharding where multiple replica sets can collaborate and provide yet another level of complexity. Today, we will stay with a single replica set.

Back, to the story. After some head scratching, reading and checking the status of the replica set it became clear there is a problem with replication. The oplog just kept growing and the secondaries couldn't keep up with the primary. In our case, this was a huge problem. Our devices keep sending data every second. They don't stop. There is no down time where the oplog can be drained. If replication falls behind it will stay behind and the replication lag will just keep growing. Since the the oplog is a capped collection it means that once the lag exceeds the capacity of the oplog (by default the max of 5% free disk space or 1GB) it will discard data. Increasing the size of the oplog just postpones the inevitable.

That was the point I got involved. I was pretty sure we're doing something wrong because our hardware was decent, the primary had no problems whatsoever handling the data from the 50 devices, so why would the secondaries have such a hard time replicating the same data?

Enter array append and replication semantics. There is surprisingly little information about the interaction between update operations and replication. Let's take a detour and explore the way Mongo does array updates. As you know Mongo can store arbitrarily complex data in JSON documents. The documents can represent nested data structures with dictionaries, arrays and primitive values. When you update a document you can either completely replace it with a new document like so:

    > c = db.test
    > c.insert({name: 'the name', 'value': 1};
    > c.find()
    { "_id" : ObjectId("5636494168dfac3864bdabf6"), "name" : "the name", "value" : 1 }
    > c.update({name: "the name"}, {name: "the new name", value: 3});
    > c.find()
    { "_id" : ObjectId("5636494168dfac3864bdabf6"), "name" : "the new name", "value" : 3 }    

Or you can update just a particular field using a variety of update operators.

For example to just set the value of a single field you can use the '$set' operator:

    > c.update({name: "the new name"}, {value: 444}, '$set');
    > c.find()
    { "_id" : ObjectId("5636494168dfac3864bdabf6"), "value" : 444 }

This is pretty straightforward. If you just want to update one field in a big document there is no need to replace the entire document. MongoDB also supports arrays. If you just want to append one item to a big array you can use the '$push' operator.

    > c.remove()
    > c.insert({array: [1,2,3]})
    > c.find()
    { "_id" : ObjectId("563658f0218f63ee3e16e185"), "array" : [ 1, 2, 3 ] }
    > c.update({}, {$push: { array: 4}})
    > c.find()
    { "_id" : ObjectId("563658f0218f63ee3e16e185"), "array" : [ 1, 2, 3, 4 ] }

But, here is where the plot thickens. Remember, that the oplog is idempotent? Now, consider the $set vs the $push operations. If you set a field to a value, you can set it to the same value multiple times, so it is idempotent. But, if you push a value into an array multiple times it will append a new value every time and the array size will increase. So the $push operation is not idempotent. How does MongoDB handle this situation? It turns out that it converts every $push operation to a $set operation in the oplog.

When the array above has the items [1, 2, 3] then a $push 4 is converted to $set [1, 2, 3, 4] in the oplog. The bottom line is that whenever you push something to an array the entire array is replaced as far as replication goes.

Back, to the story. Our data model had hourly documents where each document contained an array with 3600 items, one for each second. Whenever a new one second of data (about 100 bytes) arrived from a device the data was pushed into the hourly document. But, since MongoDB converted every push to set, in practice every second we replaced the whole array, which on average was 1800 items long (starts every hour at 0 and fills up to 3600 items at the end of the hour). That means that instead of replicating every second about 100 bytes, we actually replicated on average 1800 x 100 bytes for each of the 50 devices. That unexpected 1800X blow up of the replication load was too much for our poor replica set.

Once the problem was clear I started evaluating options. One option was to keep one hour of data in memory and every hour to write a single one hour document that will be replicated as a single 3600 x 100 bytes document. That solution solves the replication problem, but suffers from many downsides: if the primary crashes we lose on average half an hour of data. It could persist it locally and then only if the hard disk on the primary dies the data is lost, but even so we will lose the realtime aspect and the secondaries will have to wait an hour to get the latest data.

Another solution was to simply write a separate document every second. This solution addresses both the replication issue and keeps the secondaries up to date. But, there are downsides as far as performance go for creating documents with a very small amount of data. There is a lot of overhead associated with each document since our query patterns typically required at least an hour of data it was a property I wanted to preserve.

Eventually, I settled on a solution where one second documents were created and replicated, but every hour the documents from the recent hour were compacted into a single one hour document and the one second documents were removed.

Due to various other fixed overheads the total savings ended as 440X less data replicated over the wire, which eventually let us scale to 500 devices on the same cluster.

About a month after this adventure took place (May 2013) someone created a MongoDB Jira issue: https://jira.mongodb.org/browse/SERVER-9784. It is still unresolved and its status is: "planned but not scheduled".

The takeaway is that if you deal with a lot of data your data modeling should be informed by good understanding of how your database works. That lesson is even more important when distributed databases are involved where each operation might mean an expensive round trip across the wire.

Gigi Sayfan is the director of software infrastructure at Aclima (http://aclima.io), a start-up company that designs and deploys distributed sensor networks that enable a higher level of environmental awareness. Gigi has been developing software professionally for 20 years in domains as diverse as instant messaging, morphing, chip fabrication process control, embedded multi-media application for game consoles, brain-inspired machine learning, custom browser development, web services for 3D distributed game platform and most recently IoT/sensors.

This article is licensed with CC-BY-NC-SA 4.0 by Compose.