Separating Collections to Improve MongoDB Measurability

There's a common implementation pattern that exists with MongoDB users – keep everything, all our collections, in the one database. It's a very odd thing to be doing though. If we had the ability to join between Mongo collections, this would make sense, but we can't. If we could create relational constraints on our data, it would make sense, but we can't. And we don't have the ability to create transactions or atomic operations across collections, so why keep them in the same space?

We know what MongoDB is capable of, so why not embrace it's nature? We don't even need to keep these collections in the same process, so why don't we spread things our and give out data collections room to breathe?

Breathe steady

If you ever try and measure the behavior of your collections in an "all-in-one" database, you'll find it a bit of an uphill battle. All the different usage patterns intermingle and any metrics you think you are gathering are mixed with the often invisible hand of another query or queries running. Your assumptions about responses are smashed up as that set of report queries build a new document. The predictions you made over how quickly your reports will be generated are thrown to the wind as hundreds of clients update their session data.

It's as if there are multiple orchestras in the same room all playing different works. You may be able to clearly hear one when all the others pause, but that doesn't happen often. What we need to do is give each orchestra it's own auditorium so we can listen to them better. For collections, that means a database each.

Compose powered

When you get a MongoDB database with Compose, you actually get a MongoDB Compose deployment – a cluster of separately containerized systems supporting multiple databases. But, these databases all live within the same process and use the same resources so separating your data into them isn't going to help... it'll be more like putting a rope guard rail between the various orchestras.

What you need is a completely separate deployment, and a database on that deployment. Once isolated into separate deployment and databases, your collections will be more measurable and unable to interfere with each other. You will need to rework your applications code to make use of the multiple databases but most modern Mongo drivers make it easy for you to plug into multiple databases simultaneously. The question is, then, how do you get your data from one database into the isolated databases.

Enter The Transporter

Transporter is an ETL (Extract Transform Load) for people who don't like ETLs. Transporter lets you pull in data from one source, in this case it would be your original everything-in database, transform it in some way, or not as is the case here, and then write it to a destination database. We developed Transformer in 2014 as a replacement for Mongo's "copyDatabase" command which is about as successful as a groundhog at predicting weather.

MongoDB wasn't alone in having the problem of unreliable copying; other database technologies had similar problems. What we did was create a database abstraction and adapted it to each of the databases to handle the reading or writing operations. We then made it possible to construct a pipeline of those adapters and added a "transformer" class to let users modify the content in-flight between the source and the destination.

Preparing the ground

You can create a new deployment from the Compose dashboard; simply click "Add Deployment" and select what kind of database – MongoDB – you are provisioning and where you want it configured. Ideally this should be the same location as the source database.

Now you'll need to get the Transporter application. You could, as Transporter is open source, build your own binary by downloading the code from https://github.com/compose/transporter. If you are not so inclined though, the latest binary releases of the Transporter, for Ubuntu Linux and Mac OS X, are available at releases page.

Copying a single collection:

With Transporter installed, you'll need to create a called config.yml which contains something like this:

nodes:  
  source:
    type: mongo
    uri: mongodb://u:p@host:port/source_db
    namespace: source_db.collection_name
    debug: true
  dest:
    type: mongo
    uri: mongodb://u:p@host:port/dest_db
    namespace: dest_db.new_collection_name
    debug: true

Changing the URI and namespace entries to the ones for your source and destination databases and collections.
Then, run:

transporter eval —config config.yml 'Source({name:"source"}).save({name:"dest"})'  

To copy the named collection from the source database to the destination. And you're done, at least for the simplest case where the source database is at rest.

If you want to copy another collection, you can either change the namespace entries in the config.yml file, or you can override it in the command line. Any parameter in the config file can be overriden, so...

transporter eval —config config.yml 'Source({name:"source",namespace:"source_db.other_collection"}).save({name:"dest"},namespace:"dest_db.new_other_collection")'  

Data in motion

The previous incantations of Transporter work well for data at rest. If there are any changes in the database during the copy period though, those changes may well be lost. If your data is being updated and/or you have to coordinate switching your app, then you ideally want the Transporter to keep updating the destination database. For that you can use the Transporter's tail function. Before you use that though create a user on the source database with oplog permissions and use that user (and password) as the source uri value in config.yml. Now, you can run the Transporter with the tail option set to true.

transporter eval —config config.yml 'Source({name:"source",tail:true}).save({name:"dest"})'  

And it'll keep running, replicating all the inserts,updates and deletes on the collection specified, into the new database. When you are ready to switch over, just switch your application to use the new separate database and it should be right up to date.

More measurability

Once your collections are separated, you can start using the Compose tools to identify the actual loads on those collections. Each one will be isolated from any interference from other queries or updates and you'll be able to do more effective analysis on your data's activity.

You can isolate the amount of time a particular collection or set of collections spends locked. You may even find some performance improvement due to isolating particular collections in their own database deployment – best consider that a temporary respite in which you can get your analysis and optimization done though.

This isn't, it must be said, a course of action for small databases but is ideal for users with many gigabytes of data. The Compose advantage is those extra database deployments are just a button click away.