How to move data with Compose Transporter - From database to disk

Published

Transporter is a great way to move and manipulate data between databases. In this new article, we look at how you can get on board the Transporter quickly.

With the latest 0.2.1 version of Transporter, we've been making our open sourced tool for moving and manipulating data between databases even easier. In this short series of articles, we are going to show you how to get your data moving with Transporter.

Getting Transporter

You'll want your own copy of Transporter to begin with. You can find binary and source releases in github.com/compose/transporter/releases. Download the appropriate version for your system (the macOS version is -darwin) or, if you prefer to build your own, you can clone the Transporter Github repository. You'll probably want to rename the downloaded file transporter and give it execute permission (chmod u+x transporter) so it can run.

Transporter quickly

Transporter is based around the idea of building a pipeline. At one end is the source. This brings in data from databases or files and converts it into messages which the pipeline can process. Then the messages flow down the pipeline passing through filters. In Transporter terms, these filters are called transformers as they are more powerful than a simple filter and can modify the messages. The messages keep flowing downstream until they eventually reach the end of the pipeline and the sinks. Sinks take messages in and send them out to other databases.

Let's start with extracting the contents of a database to a file. This is a great way to get a handle on how Transporter and its' pipeline is configured.

The new version of Transporter has one addition which makes things much easier. The transporter init command is now the quickest way to get started creating a pipeline between two data sources. Give transporter init the names of two adaptors and it will create the configuration files needed to have one as the source of the data and one as the destination, the sink for the data. But where do you find those names?

Information about adaptors in actually built into Transporter. Run transporter about to list the available adaptors.

$ transporter about
file - an adaptor that reads / writes files  
mongodb - a mongodb adaptor that functions as both a source and a sink  
postgres - a postgres adaptor that functions as both a source and a sink  
rethinkdb - a rethinkdb adaptor that functions as both a source and a sink  
transformer - an adaptor that transforms documents using a javascript function  
elasticsearch - an elasticsearch sink adaptor  
$

To create our initial configuration, we can select one for a source adaptor and one for a sink adaptor.

Creating a configuration

Say we wish to move data from MongoDB to a file. For this we can select mongodb as the source adaptor, and file as the sink adaptor. The init command always writes out new configuration files and will overwrite existing files, so be sure to be in a clean or new directory before running it.

$ mkdir transporter-example-1
$ cd transportert-example-1
$ transporter init mongodb file
Writing transporter.yaml...  
Writing pipeline.js...  
$

There are now two files in your current directory, transporter.yaml and pipeline.js. The first file, transporter.yaml defines the nodes - the source and sink - that the transporter's pipeline will have available. Here it looks something like this:

nodes:  
  source:
    type: mongodb
    uri: ${MONGODB_URI} 
    # timeout: 30s
    # tail: false
    # ssl: false
    # cacerts: ["/path/to/cert.pem"]
    # wc: 1
    # fsync: false
    # bulk: false
  sink:
    type: file
    uri: stdout://

The nodes: label opens the list of nodes; the children of this will be the names of the nodes. There is no other signifigance to the names. Here there are two children, source and sink. The source node has a "type" setting of mongodb and the init command has now laid out all the available options: uri, timeout, ssl, cacerts, wc, fsync and bulk. Notice that most are commented out with #. The reference page for the adaptor will go into more detail on these; we'll just touch on the ones we need to change.

Setting up the nodes

First up is the uri setting. The uri is the canonical way of describing a connection to a database or similar. It can contain the protocol, host names, ports and more all in one string. Of course this isn't something that people want embedded in files. Thats why this example uses the ability of the transporter.yaml file to import environment variables.

In this case, the configuration is pulling in the MONGODB_URI environment variable... so we'd better go set that. We have a current MongoDB on Compose set up and if we ask for the connection string in the UI for the enron1 database we get this "mongodb://user:password@host-portal.1.dblayer.com:10000,host-portal.10.dblayer.com:10001/enron1?ssl=true", so let's set that in the environment.

$ export MONGODB_URI="mongodb://user:password@host-portal.1.dblayer.com:10000,host-portal.10.dblayer.com:10001/enron1?ssl=true"

That's actually the only settable value in the transporter.yaml file, so why don't we run transporter test at this point. transporter test will, given a JavaScript .js pipeline file, load up everything and test the connections. We'll use the pipeline.js as generated for now:

$ transporter test pipeline.js
Invalid URI (mongodb://user:password@host-portal.1.dblayer.com:10000,host-portal.10.dblayer.com:10001/enron1?ssl=true), unsupported connection URL option: ssl=true  

This is a MongoDB specific error; the adaptor can take everything from the connection string but the MongoDB options at the end. We have to remove that ?ssl=true from the environment variable.

$ export MONGODB_URI="mongodb://user:password@host-portal.1.dblayer.com:10000,host-portal.10.dblayer.com:10001/enron1"

And then set an appropriate other option. That was the option that turns on SSL, so we can enable it by editing the transporter.yaml file, uncommenting the ssl setting and setting it to true:

nodes:  
  source:
    type: mongodb
    uri: ${MONGODB_URI}
    ssl: true
  sink:
    type: file
    uri: stdout://

We've removed the commented out options for clarity. Now if we run the test:

transporter test pipeline.js  
TransporterApplication:  
 - Source:         source                                   mongodb         test./.*/                      mongodb://user:password@host-portal.1.dblayer.com:10000,host-portal.10.dblayer.com:10001/enron1
  - Sink:          sink                                     file            test./.*/                      stdout://

This tells us our nodes are connecting to the outside world.

Pipelines and namespaces

We can now look at what we need to set in pipeline.js. It currently looks like this:

Source({name:"source", namespace:"test./.*/"}).save({name:"sink", namespace:"test./.*/"})  

This is JavaScript and the rules of JavaScript apply. Source() creates a source and takes a set of options. There are two essential settings. The first is "name" and its value is the name of the node in the transporter.yaml file to use for configuring the source code. Here, it's source. The second essential setting is the namespace.

For the MongoDB adapter, the namespace is a combination of the database name and a regular expression which should match all the collections we want to read from. The regular expression /.*/ matches anything so all collections will be read. If you wanted to read from a specific collection, say "stuff" in the test database, the namespace would be "test.stuff".

The .save() takes the message output of the preceding part of the chain and, given its options, sets out to write it somewhere. That somewhere is defined by the "name" option which again points to the name of the node in the transporter.yaml file. Also, there's a "namespace" again. The database name on the sink is ignored, but the regular expression has to match for a message to be eligible to be written. It's a regular expression and it's set to match anything it sees.

We want to get everything from our enron1 database so all we need to change here is the first namespace, like so:

Source({name:"source", namespace:"enron1./.*/"}).save({name:"sink", namespace:"test./.*/"})  

Write that back to disk and we are now ready to run this pipeline.

$ transporter run pipeline.js
INFO[0001] adaptor Starting...                           path=source  
INFO[0001] boot map[source:mongodb sink:file]            ts=1488378538693138714  
INFO[0001] adaptor Listening...                          file="stdout://"  
INFO[0001] starting Read func                            db=enron1  
INFO[0001] collection count                              db=enron1 num_collections=3  
INFO[0001] sending for iteration...                      collection=enron db=enron1  
INFO[0001] sending for iteration...                      collection=experimental db=enron1  
INFO[0001] iterating...                                  collection=enron  
INFO[0015] Establishing new connection to host-portal.1.dblayer.com:10764 (timeout=15s)...  
INFO[0015] Establishing new connection to host-portal.10.dblayer.com:10361 (timeout=15s)...  
INFO[0015] Connection to host-portal.1.dblayer.com:10000 established.  
INFO[0015] Connection to host-portal.10.dblayer.com:10001 established.  
INFO[0015] Ping for host-portal.1.dblayer.com:10000 is 115 ms  
INFO[0015] Ping for host-portal.10.dblayer.com:10001 is 118 ms  
INFO[0030] Ping for host-portal.1.dblayer.com:10000 is 132 ms  
...

After that point, you'll want to stop that pretty quickly as it will work through your entire database echoing it all out to the console as JSON documents.

Running transporter quietly to a file

What you do see in the snippet above is the tracing. The Transporter is defaulting to being overly chatty because there's a lot of metrics and information that can be useful when setting up a Transporter. When you do go to production with Transporter, you can use the -log.level option to select which messages you want to log.

The other issue here is that everything is going to stdout, which is the default for a transporter init generated setup. We just need to change the sink entry like so:

nodes:  
  sink:
    type: file
    uri: file://dump.json

And now our database will be written to the dump.json file in the current directory. Let's run that now and we'll mute the information messages too:

$ transporter run -log.level error pipeline.js
$ ls -lh
total 3005688  
-rw-r--r--  1 dj  staff   1.4G Mar  1 15:29 dump.json
-rw-r--r--  1 dj  staff    92B Feb 28 14:04 pipeline.js
-rw-r--r--  1 dj  staff   122B Mar  1 14:35 transporter.yaml

At this point, we have Transporter extracting data from a MongoDB database and saving it as JSON data. In the next part, we'll look at configuring the Transporter to send data to a database.


If you have any feedback about this or any other Compose article, drop the Compose Articles team a line at articles@compose.com. We're happy to hear from you.

Dj Walker-Morgan
Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writer since Apples came in II flavors and Commodores had Pets. Love this article? Head over to Dj Walker-Morgan’s author page and keep reading.