Transporter Driving: 2 - From DB to DB

In part one of this series, we showed how you could use the open source Transporter from Compose to move and transform data between files and databases in both directions. The next stage is to start connecting databases to each other and in the process we'll introduce the various adapters and their capabilities. We'll start with ...

From local MongoDB to Compose MongoDB

Your data is held in a local MongoDB and you want to migrate it to a remote MongoDB – like say one of Compose's fine auto-scaling MongoDB deployments. In this case, this is going to be a one shot process, like the "Import Database" function in the Compose dashboard. Why do this? Well, you may want to do an import but your source database isn't publicly visible on the internet.

In the last part we showed how an adapter node can be a source of records or a sink for records. The logical step is, for this, to define both the source and the sink as MongoDB adapters. Here's the config.yaml file for that:

nodes:  
  infile:
    type: mongo
    uri: mongodb://localhost/test
  outfile:
    type: mongo
    uri: mongodb://username:password@host:port/database

We can now create a pipeline to join these two together in a mongotomongo.js Transporter script:

 pipeline=Source({name:"infile",namespace:"test.names"})
      .save({name:"outfile",namespace:"database.names"})

The namespace parameter specifies the database name and collection name that should be read from or written to. It's worth knowing that the config.yaml file can provide defaults for all the parameters, so if we wanted to we could move those namespace settings out to the config file and simplify the application like so:

nodes:  
  infile:
    type: mongo
    uri: mongodb://localhost/test
    namespace: test.names
  outfile:
    type: mongo
    uri: mongodb://username:password@host:port/database
    namespace: database.names

Which would leave the application as the shorter:

 pipeline=Source({name:"infile"}).save({name:"outfile"})

It's up to you whether you make a variable easily accessible to change in config.yaml or hard-coded to a particular value in the application script. That's copied everything over to the Compose database but what if we want to keep both databases synchronised.

Keeping in sync

The previous example runs through all the records in the named collection and transfers them over. Then the Transporter stops, its work done. But what if you need it to keep a collection synchronized between two servers. Well, in the case of the MongoDB adapter there's an tail option which if set to true will, when the copy has completed, switch over to reading the MongoDB oplog for changes and send over those changes too. All you need to do is add that option to the application script or the config file like so:

 nodes:
  infile:
    type: mongo
    uri: mongodb://localhost/test
    namespace: test.names
    tail: true
...

This will only work if your MongoDB is generating an oplog to be tailed and oplogs are only created when your database is being replicated as part of a replica set. If you have a simple standalone instance of MongoDB, it won't be being replicated, there'll be no oplog and the Transporter won't be able to detect the changes. To convert a standalone MongoDB to a single node replica set you'll need to start the server with --replSet rs0 (rs0 is just a name for the set) and when running, log in with the Mongo shell and run rs.initiate() to get the server to configure itself.

Now we can run the Transporter and if we make a change to our local database it will be reflected in the other.

If you are curious you can listen in on what's changed and being copied. Here's a very simple transformer, listen.js which actually does nothing but log the documents being passed to it:

module.exports = function(doc) {  
  console.log("transformer: " + JSON.stringify(doc));
  return doc;
}

Save that as listen.js and then include that in the pipeline:

pipeline=Source({name:"infile")  
      .transform({filename:"listen.js"})
      .save({name:"outfile"})

And run the transporter again. Assuming you've emptied the destination collection, the first thing we'll see all the records being copied. Then, if you log into your local MongoDB instance with the Mongo shell and make some changes to the data, you'll see each of those changes being passed over.

Practical Transporting

Now you might be wanting to look at an exercise which leverages all these components. Our next step here will be to step through loading up a database from a JSON file and then transferring and transforming the contents of that database over to another database. Let's start with a JSON file first. It just so happens that there's a twenty five thousand record example file in the MongoDB Getting Started documentation. The short version is download the file at https://raw.githubusercontent.com/mongodb/docs-assets/primer-dataset/dataset.json and save it as a file locally. You'll probably want to save it in its own directory as we'll be creating some more files in a moment.

Over on Compose, I've gone and created a MongoDB deployment with a database called dataset where we'll keep this data. Now, our next stop is where we create the config.yaml file. Here it is:

nodes:  
  infile:
    type: file
    uri: dataset.json
  outfile:
    type: mongo
    uri: mongodb://<user>:<pass>@candidate.19.mongolayer.com:11045,candidate.18.mongolayer.com:11222/dataset?replicaSet=set-5559c365165876fe50000e7d
    namespace: dataset.dataset

The infile node points at our local file and the outfile points at the MongoDB database. Remember to change that MongoDB URI to point to your own MongoDB database. Copy over that listen.js file we created in the previous section and save that in the directory too. You can even reuse that pipeline; save this:

pipeline = Source({ name: "infile" })  
  .transform({ filename: "listen.js" })
  .save({ name: "outfile" });

as filetomongo.js. Now you can run transporter run filetomongo.js and the JSON file should begin importing into the database and echoing each record onto the console as it does it. If there's any latency between you and the database then this will not go that quickly. That's because for each record, the Transporter is making a full round trip to the database. There's a way to speed that up though. Add bulk: true to the outfile node in config.yaml and it'll use bulk writes when writing to the database; it will tear through the records in a fraction of the time.

Now we have the data in one database, it's time to move some of that data into another database, after we've stripped it down. Here's an example record in our database now:

Now we've been asked to strip the data down in this new database so it's just the name, restaurant_id, zipcode and the most recent grade and when it was graded. We can also copy over the database _id field so that we can synchronise between the two data sets. We'll write a transformer for those changes now. First, we'll take on the easy part, the values we just need to copy over:

module.exports = function(doc) {  
  newdoc={};
  newdoc._id=doc._id;
  newdoc.name=doc.name;
  newdoc.restaurant_id=doc.restaurant_id
  newdoc.zipcode=doc.address.zipcode;

Simple enough. To get the most recent grade is going to take a little more work. We want to work through the array of grades and for each entry check if its the highest so far. Well, we can do that with plain JavaScript like so:

  newdoc.grade="";
  newdoc.gradedate=0;
  doc.grades.forEach(function(grade) {
    if(grade.date.$date>newdoc.gradedate) {
      newdoc.grade=grade.grade;
      newdoc.gradedate=grade.date;
    }
  });
  return newdoc;
}

Save that as reorg.js. One thing you may notice is we refer to grade.date.$date. The grade.date is an ISODate object but for simple comparisons we need the simple long value for the date and we can get that from the .$date method. If this is serialized as JSON, you'll find the date is expressed as date: { $date: longvalue } and if you look in the original dataset.json file, thats what you'll find there. Aren't date representations fun!

Anyway, now we'll need to add some nodes to the config.yaml too. Here infile2 is basically the same as outfile and outfile2 points to a different database on Compose also with a database called dataset:

  infile2:
    type: mongo
    uri: mongodb://<user>:<pass>@candidate.19.mongolayer.com:11045,candidate.18.mongolayer.com:11222/dataset?replicaSet=set-5559c365165876fe50000e7d
    namespace: dataset.dataset
  outfile2:
    type: mongo
    uri: mongodb://<user>:<pass>@lamppost.11.mongolayer.com:10164,lamppost.10.mongolayer.com:10177/dataset?replicaSet=set-5582bb5e6a666ff7e9000920
    namespace: dataset.dataset
    bulk: true

Now we need set up a pipeline called mongotomongo.js that uses those nodes and our reorg.js transformer:

pipeline = Source({ name: "infile2" })  
  .transform({ filename: "reorg.js" })
  .save({ name: "outfile2" });

We can run this with transporter run mongotomongo.js and when it's finished we'll find records that look like this:

If we want to keep the databases in sync, as before, all we need to do is add a tail:true to the infile2 and, as we're using Compose, ensure the user we are using to access the source database has the "oplog access" privilege. Inserts and updates will be passed through the transformer and deletions will remove the content of the record, but currently not the id placeholder.

Before we end, the function we used to find the most recent grade could also have been written like this:

  if (_.isEmpty(doc.grades)) {
    newdoc.grade = "";
    newdoc.gradedate = 0;
  } else {
    maxgrade = _.max(doc.grades, function(grade) {
      return grade.date.$date;
    })
    newdoc.grade = maxgrade.grade;
    newdoc.gradedate = maxgrade.date;
  }

This makes use of the Underscore.js library which is automatically included with the Transporters JavaScript environment. Although it's a little more verbose in this case, the library offers a wide range of array, list and map manipulation functions which can be very useful when transforming JSON records.

Parking up

We've shown you how to connect databases and how to use the Transporter to sync collections between them. In the next part, we'll look at the pipeline in more detail and what else we can attach to it. Drive (your Transport) Safely!