Transporter Driving - Part One

3

So, you want to drive the new Transporter? In this article we'll show you how but before we go on, Transporter is under active development and we've taken care to ensure we're only addressing the relatively stable parts of the user experience.

First get your Transporter

You could, as Transporter is open source, build your own binary by downloading the code from https://github.com/compose/transporter. If you are not so inclined though, the latest binary releases of the Transporter, for Ubuntu Linux and Mac OS X, are available at releases page.

A look under the hood

If we opened the Transporter's hood, we'd find the simplest Transporter has an engine that looks like this

transporter copy 2

At the core is the pipeline and everything works by attaching things to the pipeline. The things we attach are nodes and they are components in rather versatile plumbing kit. We start with nodes for reading and writing to databases or plain text files or whatever– they can be used as the Source (input) or the Sink (output) of the pipeline.

A Source node is designed to take data from whichever source has been selected and it pumps messages into the pipeline. A message is a wrapper around a JSON/BSON document created by the Source node. The source adds an operation code, insert, update or delete which will tell the eventual recipient what it wants done with the attached document. That message goes down the pipeline to the Sink node.

The Sink node is the opposite of the Source node, designed to take in messages from the pipeline and write them out into whatever output has been selected for them. It uses the message's operation code added by the Source node to decide whether it should insert, update or delete using the message's attached document.

Each Sink and Source is associated with an adapter which can read or write to a particular database or file. It's these associations that we get to define first. Let's create the simplest of Transporters, the file to file transporter and show how this is done.

Creating nodes

For any Transporter, there is at least two parts. The config.yaml file which is where we define the nodes and a JavaScript file where we connect the nodes. Here's our file to file config.yaml:

nodes:  
  infile:
    type: file
    uri: file:///tmp/foo
  outfile:
    type: file
    uri: file:///tmp/foo2

In the nodes: section we can create nodes with any name. Here we make two, infile and outfile. For each of these, we then define an adapter type in this case file. The file adaptor needs only one other value, the URI of the file it will be reading from or writing too. We specify that with the uri: key and a fully specified URI for the file. The file adapter can take stdout:\\ as a URI too, in which case output goes to the standard out. Save that file as config.yaml. To make sure it's readable as a set of nodes we can run transporter list and we should get back:

Name                 Type            URI  
infile               file            file:///tmp/foo  
outfile              file            file:///tmp/foo2  

We have configured our nodes. Before we move on, its worth noting that the file adaptor reads multiple JSON formatted records from the input file. We've put some mock data in a JSON file, file-to-file/testfile available to download with all these examples. Take that file and copy it to /tmp/foo. Now we're ready to do some plumbing.

Connecting the nodes

We can call our connecting JavaScript file anything we like. Let's keep it simple and call it filetofile.js. In there we can put our code...

pipeline=Source({name:"infile"}).save({name:"outfile"})  

Short and sweet. Pipeline is just the name of a variable, the Source() function returns a new pipeline instance to be stored in it - pipeline's all start with a source, so where better to start the pipeline than by specifying the Source. The .save() function attaches a sink to the pipeline where our records are destined. The function also returns the pipeline instance You could also write this code as:

pipeline=Source({name:"infile"});  
pipeline.save({name:"outfile"})  

But chaining the functions together makes for an easier read. Going even further, you can even drop the pipeline= part to leave:

Source({name:"infile"}).save({name:"outfile"})  

As you only need the variable if you want to add more nodes or do more complex pipelines. What counts is that the pipeline is created - once created, it will be a ready-to-run pipeline. Anyway, that's out plumbing done. Now we can check the connections with the transporter test command:

$ transporter test filetofile.js

TransporterApplication:  
 - Source:         infile                  file               file:///tmp/foo
  - Sink:          outfile                 file               file:///tmp/foo2

This shows us the file Source "infile" feeds into the file Sink "outfile". All looks fine, so lets run our script:

$ transporter run filetofile.js
$

As with all good commands, silence can be taken to mean success, If you want to see what has happened, compare the contents of /tmp/foo and /tmp/foo2 and you'll find the fields have been rearranged into alphabetic key order as they've been processed.

Introducing Transformers

There's another kind of node you can attach to a pipeline - the Transformer nodes are nodes which can run JavaScript code.

transportertransformer

The Transformer node takes in messages, hands them to a block of JavaScript code for processing and, depending on the return values from that block of code, passes the message on down the pipeline or disposes of it. The JavaScript code in a Transformer is given a message. All it has to do is return a new message, based on whatever manipulations – removing and adding fields, tweaking values and the like – it wants to do. The returned message is passed down the pipeline to, in this case, the Sink node.

Adding a Transformer

In our testfile data, we only want to retain some fields, id, first name, last name and email address. We need to create a Transformer file first. We'll call it justnameemail.js and it looks like this:

module.exports = function(doc) { return _.pick(doc, ["id", "first_name", "last_name", "email"]) }  

This one line sets a function to be exported with module.exports=function and that function takes a document (doc) and its that variable through which the message, in for form of a JavaScript map, is passed to the function. The function itself returns the result of calling the Underscore library function pick which extracts named fields from the message map and makes a new map using them. If you want to know more about what you can do in the Transformer JavaScript, the current implementation is very similar to the Compose production Transformer which you can read about in Powering up your data transfer.

With our transformation defined, we now need to plug it into the pipeline. Back to our Transporter JavaScript file, which we've renamed filetransformer.js:

pipeline=Source({name:"infile"}).transform({filename:"justnameemail.js"}).save({name:"outfile"})  

We've just inserted the .transform() function before the save and pass it a parameter, filename, which contains the file name of our transformation script. If we test this new JavaScript file we get...

$ transporter test filetransformer.js
TransporterApplication:  
 - Source:         infile                                   file                 file:///tmp/foo
  - Transformer:   db85aafe-cd17-4b61-5dda-2c0d25c67b90     transformer          justnameemail.js
   - Sink:         outfile                                  file                 file:///tmp/foo2

And you can see there's a transformer in the middle now (with a generated name). We can run this script now and again, on successful completion, there'll be silence. But if you compare /tmp/fooand /tmp/foo2...

$ head -3 /tmp/foo
{"id":1,"first_name":"Angela","last_name":"Castillo","email":"acastillo0@digg.com","country":"Russia","ip_address":"16.247.173.93"}
{"id":2,"first_name":"Jimmy","last_name":"Wallace","email":"jwallace1@symantec.com","country":"Sweden","ip_address":"169.248.117.209"}
{"id":3,"first_name":"Keith","last_name":"Patterson","email":"kpatterson2@phpbb.com","country":"Ireland","ip_address":"195.128.151.190"}
$ head -3 /tmp/foo2
{"email":"acastillo0@digg.com","first_name":"Angela","id":1,"last_name":"Castillo"}
{"email":"jwallace1@symantec.com","first_name":"Jimmy","id":2,"last_name":"Wallace"}
{"email":"kpatterson2@phpbb.com","first_name":"Keith","id":3,"last_name":"Patterson"}
$

You'll quickly see the fields that have been discarded. We've filtered our data down, the next stop is sending it on to a database.

From file to database

You'll need your MongoDB database URI for this. If your MongoDB is local, then mongodb://localhost/dbname will do. If you have MongoDB on Compose, then you can find your database URI on the dashboard on the admin page. We then need to open up the config.yaml file and change the outfile node. For MongoDB you need to change the type of node to mongo to get it to use the MongoDB adapter. Next, we put our database's URI as the uri value. So for a Compose database called "demonstratum" it would look something like:

outfile:  
  type: mongo
  uri: mongodb://<user>:<password>@lamppost.8.mongolayer.com:10047,lamppost.1.mongolayer.com:10128/demonstratum

Once thats done, it's over to the javascript file. In the save() function, we give a node name to look up, but we can also hand over required and optional parameters. One required parameter for the MongoDB adapter is namespace which incorporated the database name and collection name for where the new records are going to be read from or written to. The namespace parameter can be put in the config.yaml file to help keep the transporter JavaScript simple. Parameters for nodes can be set in the config file and over-ridden in the JavaScript where needs be. For this example, we'll add the namespace parameter in the JavaScript...

pipeline=Source({name:"infile"}).transform({filename:"justnameemail.js"}).save({name:"outfile",namespace:"demonstratum.names"}  

We've save this example as filetomongodb.js. If we run this with transporter run filetomongodb.js then go and check the database, we find it full of our filtered down records.

And from database to file

Following through on changing the type of the outfile node to mongo, it makes sense to look at, instead, changing the 'infile' node to mongo. We can simply swap the infile and outfile parameters around like this:

infile:  
  type: mongo
  uri: mongodb://<user>:<password>@lamppost.8.mongolayer.com:10047,lamppost.1.mongolayer.com:10128/demonstratum
outfile:  
  type: file
  uri: file:///tmp/foo3

Let's simplify the transformer script so it only copies the the id and email address:

module.exports = function(doc) { return _.pick(doc, ["id", "email"]) }  

We'll call that file justemail.js. Now all thats left is to construct the pipeline:

pipeline=Source({name:"infile",namespace:"demonstratum.names"}).transform({filename:"justemail.js"}).save({name:"outfile"})  

Apart from the script name change in transform(), all we've done is move the namespace parameter to the source. If we run this with transport run mongodbtofile.js it should leave us with a /tmp/foo3 file that looks like this?

{"email":"acastillo0@digg.com","id":1}
{"email":"jwallace1@symantec.com","id":2}
{"email":"kpatterson2@phpbb.com","id":3}
{"email":"jlynch3@apple.com","id":4}
{"email":"plee4@vk.com","id":5}
{"email":"jrogers5@nsw.gov.au","id":6}
...

So far...

We've covered the fundamentals of configuring and running the Transporter and you should already be equipped to move JSON formatted data around while transforming it. In the next part, we'll look at connecting databases to each other with the Transporter, how to make the Transporter keep databases in sync and look at some of the other things the Transporter can do so it can fit in with your infrastructure.

Until next time, safe transporting!