Transporter's new namespace-aware data transfer

With the release of the latest Transporter, version 0.1.0, the authors have made quite a change to Transporter with its new namespace handling. The TL;DR version is you are now able to define the latter part of the namespace as a regular expression and we make much more active use of that namespace. Read on for the full details.

Sources and namespace

The namespace on a source like the MongoDB adapter, is comprised of the database name and collection. The database name is, in MongoDB's case, fixed by the connection URL, so it's only the second part, the collection name you can work with. Regular expressions in Transporter are denoted by being surrounded by slash (/) characters.

Say you wanted to transport data from the collections mydb.planes, mydb.trains and mydb.automobiles, you'd use the namespace mydb./(planes|trains|automobiles)/ in the source...

  pipeline = Source({ name: "indb",
      namespace:"mydb./(planes|trains|automobiles)/"  });

Under the hood, there's a whole lot more work going on to ensure all the matching collections are going to be picked up, and each one becomes a message.

We're using a simple regular expression here, and you may assume that it would only match one of the three named collections. But with this expression you'd be wrong. If you had a collection called aeroplanes it would also match the expression and be used to generate Transporter messages. Why?

Regular expressions can be very precise or very liberal and in this case, we've said any collection name which includes the words trains, planes or automobiles will be copied and aeroplanes contains the word planes. If we want to be precise we could say that the names should be the only thing between the start and the end of the collection name. Regular expressions let you do that as they have a character that denotes the start of a line – ^ – and one for the end of a line – $. If we surround the regular expression with these, then this will give us the precision we want.

  pipeline = Source({ name: "indb", 
      namespace:"mydb./^(planes|trains|automobiles)$/"  });

It is unlikely, but if you have collections or other components of your namespace with characters such as + and * or other special characters in regular expressions, you'll probably want to escape the characters. But those characters are also the key to the power of regular expressions so...

  pipeline = Source({ name: "indb",
      namespace:"mydb./^(prod|dev)\-[0-9]+$/"  });

This can match any collection named prod or dev followed by a - and one or more digits following that, and only digits, to the end of the collection name. If you want to know more about the particular syntax we're using, it's the Go languages regular expression library, based on RE2 - Here's the syntax.

The namespace support doesn't stop at the source of the data though. It enables a whole new way of arranging your Transporter pipelines.

Matching namespaces

Messages in the Transporter have always carried namespace information with them, but since version 0.4.0, that namespace information has become visible to transformers and sink destinations and can play an active part in how you flow data through your Transporter pipeline.

For transformers and sinks, a namespace has to be defined. That namespace definition selects which Messages the transformer or sink will be called upon to process. Only the latter part, after the first ., of the a transformer or sinks namespace is significant in filtering. A sink will make use of the first part of the namespace, as we'll see later.

Let's start with a simple example. If we had a transformer script we only wanted to run on our data from our trains collection we could add...

pipeline.transform({ filename: "trains.js", namespace:"mydb./trains/" });  

Remember, this is not a precise match so anything with trains in the latter part of the namespace would be processed by the transformer. You can use the ^ and $ to tighten up the selection to mydb./^trains$/ to ensure only the trains collection is processed.

When the message is processed by the Transformer, as well as the data in the record, there's also the ts field (a timestamp for the message) the op field (a string which says operation this message represents) and the ns field which contains the messages namespace as set by the source and whatever else "upstream" of the transformer. We'll see how we can use this later.

In the same way, we can also use regular expressions to match more, if not all collections. If we wanted all collection data to pass through another transformer we could do...

pipeline.transform({ filename: "all.js", namespace:"mydb./.*/" });  

If you don't know regular expressions, the . means "any character" and the * means "zero or more occurrances of what precedes me", so .* is none or any number of any characters.

The matching also works when writing our data out. So if we want the behavior of previous Transporter versions, then we can use the previous regexp to select all messages.

pipeline.save({ name: "outdb", namespace:"mydb./.*/" });  

You can attach multiple transformers and save sinks to the pipeline all with different namespace matching filters. For example, you could send the message to one of three different databases...

pipeline.save({ name:"outdb1", namespace: "mydb./planes/" });  
pipeline.save({ name:"outdb2", namespace: "mydb./trains/" });  
pipeline.save({ name:"outdb3", namespace: "mydb./automobiles/" });  

The adaptor that writes the messages out is still responsible for interpreting the namespace for writing out the message. So, for example, the MongoDB adaptor will use the latter part of the message's namespace to select the collection the message will be written to, but it will use the database name from the filtering namespace to select the database to write with. So our example above is more likely to be:

pipeline.save({ name:"outdb1", namespace: "outdb1./planes/" });  
pipeline.save({ name:"outdb2", namespace: "outdb2./trains/" });  
pipeline.save({ name:"outdb3", namespace: "outdb3./automobiles/" });  

... assuming the databases were, of course, outdb1, outdb2 and outdb3.

Changing Namespaces

What if you actually want to change the namespace to which the message is written. The Transporter has you covered there too. If you've followed the Transporter project you'll know that in the previous release we changed how we packaged messages and made the metadata that travelled with your record visible inside Transformers. We can now use that extra capability because part of that metadata is the namespace, stored in the ns field.

Let's say we want to put all our "trains" and "automobile" collections into one "ground" collection and put the "planes" into the "sky" collection. Let's script a transformer to make that change:

module.exports = function(doc) {  
  if(doc.ns=="mydb.trains" || doc.ns=="mydb.automobiles") {
    doc.ns="mydb.ground";
  } else if(doc.ns=="mydb.planes") {
    doc.ns="mydb.sky";
  }
  return doc;
}

We can connect this to our pipeline and send it whichever messages we wish and only the ones which have trains, planes and automobiles changed.

Wrapping up

There are some things that are still being worked on with the namespace implementation. With that in mind, currently namespaces defined in the config.yaml file are effectively ignored as the JavaScript parser will demand that namespaces are defined in the JavaScript pipeline definition. Old style namespace definitions "dbname.collection" will be accepted by the system but the collection name will be treated as an unbound string match expression; if you have lots of similarly named collections, it's best to switch to the /regexp/ format and make your collection specification more precise.

Transporter is an evolving project, and as such we expect to refine the handling of these issues. The developers wanted to get this new namespace handling in users hands as soon as possible to ensure they got feedback as quickly as they could, so please check out the namespace filtering and let us know how you get on with it through the Transporter issues page on Github.

Photo Source: Stephen Tierney CC-BY-NC-2.0