Transporter, MongoDB and synchronization


Transporter was developed originally to synchronize MongoDB databases. In this article, we'll look at how to configure the latest generation of Transporter to do just that.

We've looked in past articles on how data can flow from database to disk and disk to database and introduced how we can manipulate that flow of data with Transformers. Now, I want to look at moving some data between databases. The simplest yet useful connection is MongoDB to MongoDB.

Consider, as an example that we want to move a collection from one MongoDB deployment to another. It's a useful trick you can perform when you want to isolate the impact of different workloads on the database, but if you are that busy, you need to be able to do it as transparently as possible. Not a problem with the Transporter to hand. Let's get a configuration file going:

$ transporter init mongodb mongodb
Writing pipeline.js...  

Now we have a pipeline.js file, which starts out with a source and a sink for MongoDB databases. In this case, I'm going to move our enron mail example database - people are doing some complex queries on it and it's interfering with production so it needs to go somewhere else. I'll be moving it from one deployment, the source, into another, the sink, in a different datacenter. Both databases have SSL enabled so I have to allow for that too. I'm going to jump ahead and show you the pipeline.js file that works to do this then explain what it all means.

var source = mongodb({  
    "uri": "${MONGODB_SOURCE_URI}",
    "ssl": true,
    "cacerts": ["${MONGODB_SOURCE_CERT}"],

var sink = mongodb({  
    "uri": "${MONGODB_SINK_URI}",
    "ssl": true,
    "cacerts": ["${MONGODB_SINK_CERT}"],
    "bulk": true

t.Source("source", source, "/^enron$/").Save("sink", sink, "/^enron$/");  

Right, this is a little different from the generated pipeline.js in that I've added SOURCE or SINK to all the environment variables for URIs, and rather than hardwire the path to the certificate files in the cacerts arrays, I've made them environment variables too using that ${ENVNAME} embedded variable syntax. It is worth remembering that you can map any value in a Transporter pipeline to an environment variable; it makes them much more reusable and easier to deploy in containers where you may only be able to pass environment variables.

Now we need to create the information to populate those environment variables. We'll go to each of the database deployments and create users called "transporteruser" with a password "transporterpass" (so you can see them clearly). You can use whatever users and passwords you've configured. On the admin pages for the databases, you'll find the URIs for them. Remember to clip off the ?ssl=true from the end of them. Then we download the self-signed SSL certificates for each and save them as source.pem and sink.pem.

With all that information, I make a file which I can source into my shell.

export MONGODB_SOURCE_URI=mongodb://,  
export MONGODB_SOURCE_CERT=./source.pem  
export MONGODB_SINK_URI=mongodb://,  
export MONGODB_SINK_CERT=./sink.pem  

You should be able to see how these variables substitute into the pipeline.js file when it runs.

There are two properties that are explicity set though. One is self-explanatory: "ssl":true enables SSL support replacing the ?ssl=true we clipped off the URIs. The other is less obvious: "bulk":"true" is only enabled on the sink and that turns bulk writes on. Does this make a difference? Yes, a huge one; with non-bulk writes the transporter has to make a round trip to the sink database for every document transported. With bulk enabled, many documents are bundled up in one trip. It is especially noticeable if you are running Transporter from a location geographically remote from your sink - the further away you are, the longer it takes without bulk.

The only other change we've made is to ensure that only the enron collection is copied. That's done by setting the namespace in the actual pipeline:

t.Source("source", source, "/^enron$/").Save("sink", sink, "/^enron$/");  

A brief reminder here that the namespace parameter is a regular expression and the "/^" and "$/" are there to ensure that only the word "enron" gets through. We're all done so a quick

$ transporter run

And the Transporter starts copying the collection... and when it's done it stops running. That's great for this collection because of its historical data. But what if that collection was live data; how would we manage to copy it consistently. With the MongoDB adaptor, there's an option to tail the oplog, MongoDB's replication trail and this lets programs see changes in real time. Turn the option on and when the initial copying has finished, the Transporter stays running listening to the oplog and creating new messages which contain documents with all the changes. So all you need to add to the source properties is "tail": true.

There is a complication though. The oplog is somewhat protected. If you're running your own MongoDB database then you'll want to create a user capable of reading the local database which is where the oplog collection lives. If you are on older Compose MongoDB Classic, just create a user with the oplog privilege.

If you are on current Compose MongoDB, you'll need to turn on the Oplog Add-on which handles getting the oplog from a sharded database. That will offer you a oploguser, password and URI. You'll need to edit the URI, removing the &ssl=true from the end and replacing local with the name of your database, which in my case is enron. Use that URI as MONGODB_SOURCE_URI in your environment variables. I'll also download the SSL certificate from the add-on as oplogsource.pem and set that as MONGODB_SOURCE_CERT.

Set the Transporter running with these changes and once the copy is complete, you'll be able to insert, delete and edit the source collection and see the changes appear in the sink collection.

With this information in hand, you can now copy and synchronize collections between MongoDB databases. Being the Transporter, you can also modify the data using Transformers allowing you to create new document structures rather than just clones. Be aware, though that when you are tailing the oplog, you'll get different messages types which represent the changes in records; don't assume they'll all be "inserts" as they are when you copy the database without tailing. We'll look at this in more detail in a future article on Transformers and how to use them effectively.

attribution Marina Vitale

Dj Walker-Morgan
Dj Walker-Morgan was Compose's resident Content Curator, and has been both a developer and writer since Apples came in II flavors and Commodores had Pets. Love this article? Head over to Dj Walker-Morgan’s author page to keep reading.

Conquer the Data Layer

Spend your time developing apps, not managing databases.