Transporter Driving: 2 - From DB to DB
PublishedIn part one of this series, we showed how you could use the open source Transporter from Compose to move and transform data between files and databases in both directions. The next stage is to start connecting databases to each other and in the process we'll introduce the various adapters and their capabilities. We'll start with ...
From local MongoDB to Compose MongoDB
Your data is held in a local MongoDB and you want to migrate it to a remote MongoDB – like say one of Compose's fine auto-scaling MongoDB deployments. In this case, this is going to be a one shot process, like the "Import Database" function in the Compose dashboard. Why do this? Well, you may want to do an import but your source database isn't publicly visible on the internet.
In the last part we showed how an adapter node can be a source of records or a sink for records. The logical step is, for this, to define both the source and the sink as MongoDB adapters. Here's the config.yaml file for that:
nodes:
infile:
type: mongo
uri: mongodb://localhost/test
outfile:
type: mongo
uri: mongodb://username:password@host:port/database
We can now create a pipeline to join these two together in a mongotomongo.js
Transporter script:
pipeline=Source({name:"infile",namespace:"test.names"})
.save({name:"outfile",namespace:"database.names"})
The namespace parameter specifies the database name and collection name that should be read from or written to. It's worth knowing that the config.yaml
file can provide defaults for all the parameters, so if we wanted to we could move those namespace settings out to the config file and simplify the application like so:
nodes:
infile:
type: mongo
uri: mongodb://localhost/test
namespace: test.names
outfile:
type: mongo
uri: mongodb://username:password@host:port/database
namespace: database.names
Which would leave the application as the shorter:
pipeline=Source({name:"infile"}).save({name:"outfile"})
It's up to you whether you make a variable easily accessible to change in config.yaml
or hard-coded to a particular value in the application script. That's copied everything over to the Compose database but what if we want to keep both databases synchronised.
Keeping in sync
The previous example runs through all the records in the named collection and transfers them over. Then the Transporter stops, its work done. But what if you need it to keep a collection synchronized between two servers. Well, in the case of the MongoDB adapter there's an tail
option which if set to true will, when the copy has completed, switch over to reading the MongoDB oplog for changes and send over those changes too. All you need to do is add that option to the application script or the config file like so:
nodes:
infile:
type: mongo
uri: mongodb://localhost/test
namespace: test.names
tail: true
...
This will only work if your MongoDB is generating an oplog to be tailed and oplogs are only created when your database is being replicated as part of a replica set. If you have a simple standalone instance of MongoDB, it won't be being replicated, there'll be no oplog and the Transporter won't be able to detect the changes. To convert a standalone MongoDB to a single node replica set you'll need to start the server with --replSet rs0
(rs0 is just a name for the set) and when running, log in with the Mongo shell and run rs.initiate()
to get the server to configure itself.
Now we can run the Transporter and if we make a change to our local database it will be reflected in the other.
If you are curious you can listen in on what's changed and being copied. Here's a very simple transformer, listen.js
which actually does nothing but log the documents being passed to it:
module.exports = function(doc) {
console.log("transformer: " + JSON.stringify(doc));
return doc;
}
Save that as listen.js and then include that in the pipeline:
pipeline=Source({name:"infile")
.transform({filename:"listen.js"})
.save({name:"outfile"})
And run the transporter again. Assuming you've emptied the destination collection, the first thing we'll see all the records being copied. Then, if you log into your local MongoDB instance with the Mongo shell and make some changes to the data, you'll see each of those changes being passed over.
Practical Transporting
Now you might be wanting to look at an exercise which leverages all these components. Our next step here will be to step through loading up a database from a JSON file and then transferring and transforming the contents of that database over to another database. Let's start with a JSON file first. It just so happens that there's a twenty five thousand record example file in the MongoDB Getting Started documentation. The short version is download the file at https://raw.githubusercontent.com/mongodb/docs-assets/primer-dataset/dataset.json and save it as a file locally. You'll probably want to save it in its own directory as we'll be creating some more files in a moment.
Over on Compose, I've gone and created a MongoDB deployment with a database called dataset where we'll keep this data. Now, our next stop is where we create the config.yaml file. Here it is:
nodes:
infile:
type: file
uri: dataset.json
outfile:
type: mongo
uri: mongodb://<user>:<pass>@candidate.19.mongolayer.com:11045,candidate.18.mongolayer.com:11222/dataset?replicaSet=set-5559c365165876fe50000e7d
namespace: dataset.dataset
The infile
node points at our local file and the outfile
points at the MongoDB database. Remember to change that MongoDB URI to point to your own MongoDB database. Copy over that listen.js
file we created in the previous section and save that in the directory too. You can even reuse that pipeline; save this:
pipeline = Source({ name: "infile" })
.transform({ filename: "listen.js" })
.save({ name: "outfile" });
as filetomongo.js. Now you can run transporter run filetomongo.js
and the JSON file should begin importing into the database and echoing each record onto the console as it does it. If there's any latency between you and the database then this will not go that quickly. That's because for each record, the Transporter is making a full round trip to the database. There's a way to speed that up though. Add bulk: true
to the outfile
node in config.yaml
and it'll use bulk writes when writing to the database; it will tear through the records in a fraction of the time.
Now we have the data in one database, it's time to move some of that data into another database, after we've stripped it down. Here's an example record in our database now:
Now we've been asked to strip the data down in this new database so it's just the name, restaurant_id, zipcode and the most recent grade and when it was graded. We can also copy over the database _id field so that we can synchronise between the two data sets. We'll write a transformer for those changes now. First, we'll take on the easy part, the values we just need to copy over:
module.exports = function(doc) {
newdoc={};
newdoc._id=doc._id;
newdoc.name=doc.name;
newdoc.restaurant_id=doc.restaurant_id
newdoc.zipcode=doc.address.zipcode;
Simple enough. To get the most recent grade is going to take a little more work. We want to work through the array of grades and for each entry check if its the highest so far. Well, we can do that with plain JavaScript like so:
newdoc.grade="";
newdoc.gradedate=0;
doc.grades.forEach(function(grade) {
if(grade.date.$date>newdoc.gradedate) {
newdoc.grade=grade.grade;
newdoc.gradedate=grade.date;
}
});
return newdoc;
}
Save that as reorg.js
. One thing you may notice is we refer to grade.date.$date
. The grade.date
is an ISODate object but for simple comparisons we need the simple long value for the date and we can get that from the .$date
method. If this is serialized as JSON, you'll find the date is expressed as date: { $date: longvalue }
and if you look in the original dataset.json file, thats what you'll find there. Aren't date representations fun!
Anyway, now we'll need to add some nodes to the config.yaml
too. Here infile2
is basically the same as outfile
and outfile2
points to a different database on Compose also with a database called dataset
:
infile2:
type: mongo
uri: mongodb://<user>:<pass>@candidate.19.mongolayer.com:11045,candidate.18.mongolayer.com:11222/dataset?replicaSet=set-5559c365165876fe50000e7d
namespace: dataset.dataset
outfile2:
type: mongo
uri: mongodb://<user>:<pass>@lamppost.11.mongolayer.com:10164,lamppost.10.mongolayer.com:10177/dataset?replicaSet=set-5582bb5e6a666ff7e9000920
namespace: dataset.dataset
bulk: true
Now we need set up a pipeline called mongotomongo.js
that uses those nodes and our reorg.js
transformer:
pipeline = Source({ name: "infile2" })
.transform({ filename: "reorg.js" })
.save({ name: "outfile2" });
We can run this with transporter run mongotomongo.js
and when it's finished we'll find records that look like this:
If we want to keep the databases in sync, as before, all we need to do is add a tail:true
to the infile2
and, as we're using Compose, ensure the user we are using to access the source database has the "oplog access" privilege. Inserts and updates will be passed through the transformer and deletions will remove the content of the record, but currently not the id placeholder.
Before we end, the function we used to find the most recent grade could also have been written like this:
if (_.isEmpty(doc.grades)) {
newdoc.grade = "";
newdoc.gradedate = 0;
} else {
maxgrade = _.max(doc.grades, function(grade) {
return grade.date.$date;
})
newdoc.grade = maxgrade.grade;
newdoc.gradedate = maxgrade.date;
}
This makes use of the Underscore.js library which is automatically included with the Transporters JavaScript environment. Although it's a little more verbose in this case, the library offers a wide range of array, list and map manipulation functions which can be very useful when transforming JSON records.
Parking up
We've shown you how to connect databases and how to use the Transporter to sync collections between them. In the next part, we'll look at the pipeline in more detail and what else we can attach to it. Drive (your Transport) Safely!