Transporter Transformers - Powering up Your Data Transfer

Since this article was published, we've removed the integrated version of the Transporter from the UI and now direct users to the open source version of Transporter which is more powerful and flexible. (8/2016)

We recently introduced Transporter at Compose, a new component for moving data from MongoDB to Elasticsearch. In its simplest configuration, it's easy to use – you just select the MongoDB database and collection you want to pull documents from and then select the Elasticsearch index and type you want those documents copied to. Then you click Start Transport and a job is created to do your transfer. But there's also an option for you to take more control of your data migration, the Transformation function.

Transporter Transformer Demonstrated from Compose on Vimeo.

By default, the Transporter takes a direct and full copy of the source MongoDB documents and sends it on to the Elasticsearch destination. The key to unlocking the power of the Transporter is a JavaScript interpreter which can be placed in between these input and output processes. Each document is converted into a JavaScript object and then passed to a function in the script. The function takes that document and is expected to return a document that will be passed to the destination.

So, let's assume we have a MongoDB database full of Twitter tweets and we want to perform some analytics on them in Elasticsearch. We go in to the dashboard for our Elasticsearch database and select Transporter to bring up the configuration page. Here we can select our MongoDB database (exemplum) and enter the name of the collection we want to work with (tweetstore). Now, we could carry on down the page and fill in the name of the Elasticsearch index and type we want to import the data into, but before we press the Start Transport and let the default transport happen, lets go back up the page to the Transformation section and click on Add a transformation function.

The page will change its layout. At the top of the page will be a text area labelled Sample Documents. This will be populated with the first ten records pulled from the named collection. Below that is an editing area which holds a snippet of JavaScript. Below that is another text area, this time showing the result of the transformer function on the records shown in the top window. Watch is change as we start to edit the middle window.

Taking control

As we've opened the transformation editor now, the middle window should contain:

module.exports = function(doc){  
  return _.pick(doc, ["_id", "lang", "location", "payload", "topic", "tweet"])
}

This is the auto-generated snippet designed to make it easier to start defining your own function and has already been populated with the field names to copy. The code defines an exports function which takes a JavaScript object representing a document (doc). The function has to build and return a new document based on this incoming document. Those results should be shown in the lower window.

In this case, the transformer uses a function from the automatically-included Underscore.js library to construct that document. Underscore has a whole host of useful manipulation functions for objects and arrays so supplements JavaScript's basic functionality well.

The Underscore pick method is a filter which takes an object and an array of keys which should be copied from that object to create a new object. The Transporter user interface has populated that array of keys with the top-level keys from a JSON document, based upon a sampling of the first document in the MongoDB database. This assumes that the records are fairly regular.

Selecting or removing fields

By modifying this key array, you can choose which parts of the document are actually passed to Elasticsearch. For example, if you only wanted the "_id" and "payload" data, you could use just use

module.exports = function(doc){  
  return _.pick(doc, ["_id", "payload"])
}

You may want to work the other way around and just remove particular fields. For that, the Underscore omit method is handy as it uses the array of keys as a list of things to not include. Say we want to lose the "lang" and "location" data, we can use

module.exports = function(doc){  
  return _.omit(doc, ["lang", "location"])
}

in the transformer function and that will strip out those two items.

Creating, skipping and counting documents

You may want to make an entirely new document, reorganising the content to suit your destination database. Say we want to create a new database with the twitter user screenname and name as fields. The function would look something like

module.exports = function(doc){  
  newdoc={};
  newdoc._id=doc._id;
  newdoc._userscreenname=doc.tweet.user.screenname;
  newdoc._name=doc.tweet.user.name;
  return newdoc;
}

We may want to stop some records going to Elasticsearch. For example, say we only wanted tweets which were recorded as being in the english language; anything else we'd want to lose. To drop a record, we simply have to return false from the exports function so...

module.exports = function(doc){  
  if(doc.lang=="en") {
    return _.omit(doc, ["lang"])
  }
  return false;
}

We use omit here to also lose the lang field because we'll already know what language these imported tweets are in.

Finally, say we wanted to omit every fifth tweet. We can use the the fact that its not just the JavaScript function thats compiled, but the entire snippet. That means code outside the function will be run at startup and it lets us do this...

var counter=0;

module.exports = function(doc) {  
   counter=counter+1;
   if(counter%5==0) {
       return false;
    }
    return doc;
}

3

Not a hugely useful example, but it shows you can have global variables that persist over the lifetime of the Transport. You could, more practically, have a map of abbreviations and their expansions as a global and use that to expand particular fields. Or you could have a list of keywords and scan document fields for their presence.

Finally, a tip for the current user interface. There isn't, currently, a save function for transformer functions so you will have to re-enter them if you want to re-run a Transport. We recommend that you copy and paste your functions into an text editor before pressing Start Transport to preserve them.

Update: We've also added a Continuous synchronisation mode to the Transporter which also has Transformer support.