Transporter Driving 3: Restoring and Extracting

We've previously shown some examples of how to use the Compose Transporter for moving data around between databases, and how to upload files to other databases. Transporter can also be used to transform existing data into more useful formats too. For this example, we're going to recreate a task we had to perform to get some test data for a range of databases. The story starts with a MongoDB dump of what's called the Enron Email corpus.

This file, available on the web contains over 500,000 email items from Enron and has been converted into a number of formats, but please bear in mind the history of the data when using it. The MongoDB edition of the data is most interesting as the email headers have been broken out making it good for searches and analysis. The problem is this is all encoded as a backup and not as a JSON formatted file.

Loading the database

Our task is to fix that and the first stop is to load it into a local database. Lets make a directory and cd in to it. Then, we want to download the archive from the site and extract it using tar in our new directory:

$ tar xvZf enron_mongo.tar.bz2
x dump/  
x dump/enron_mail/  
x dump/enron_mail/messages.bson  
x dump/enron_mail/system.indexes.bson  

And now we have some bson files which will need reconstituting. There was a time when you could use the MongoDB tools on those files directly, but not since MongoDB 3.0. The current tool set needs a running database to work with. Let's install MongoDB locally first...

$ brew install mongodb
==> Downloading https://homebrew.bintray.com/bottles/mongodb-3.0.5.yosemite.bott
######################################################################## 100.0%
==> Pouring mongodb-3.0.5.yosemite.bottle.tar.gz
==> Caveats
To have launchd start mongodb at login:  
  ln -sfv /usr/local/opt/mongodb/*.plist ~/Library/LaunchAgents
Then to load mongodb now:  
  launchctl load ~/Library/LaunchAgents/homebrew.mxcl.mongodb.plist
Or, if you don't want/need launchctl, you can just run:  
  mongod --config /usr/local/etc/mongod.conf
==> Summary
🍺  /usr/local/Cellar/mongodb/3.0.5: 17 files, 154M

Ok, we cheated a little here by using Brew to install MongoDB. You could of course download it from the MongoDB web site and install it like that, but Brew on OS X makes things really quick and simple. Anyway, now we need to get MongoDB running...

$ mongodb --dbpath data

And you'll want to open a new window because that window has MongoDB running in it. Now, in our new window, and after cd'ing our working directory, we can run mongorestore:

$ mongorestore
2015-08-26T10:27:00.050-0700    using default 'dump' directory  
2015-08-26T10:27:00.060-0700    building a list of dbs and collections to restore from dump dir  
2015-08-26T10:27:00.060-0700    no metadata file; reading indexes from dump/enron_mail/system.indexes.bson  
2015-08-26T10:27:00.061-0700    restoring enron_mail.messages from file dump/enron_mail/messages.bson  
2015-08-26T10:27:03.060-0700    [####....................]  enron_mail.messages  236.7 MB/1.4 GB  (17.0%)  
2015-08-26T10:27:06.064-0700    [########................]  enron_mail.messages  499.2 MB/1.4 GB  (35.9%)  
2015-08-26T10:27:09.061-0700    [###########.............]  enron_mail.messages  677.9 MB/1.4 GB  (48.7%)  
2015-08-26T10:27:12.061-0700    [###############.........]  enron_mail.messages  918.3 MB/1.4 GB  (66.0%)  
2015-08-26T10:27:15.060-0700    [##################......]  enron_mail.messages  1.0 GB/1.4 GB  (75.6%)  
2015-08-26T10:27:18.062-0700    [#####################...]  enron_mail.messages  1.2 GB/1.4 GB  (90.8%)  
2015-08-26T10:27:19.920-0700    restoring indexes for collection enron_mail.messages from metadata  
2015-08-26T10:27:19.920-0700    finished restoring enron_mail.messages (501513 documents)  
2015-08-26T10:27:19.920-0700    done  

Mongorestore assumed that our data was in a dump directory and that we were talking to a local MongoDB. The final check would be to ensure that data is all there. We'll just log in and look:

mongo enron_mail  
MongoDB shell version: 3.0.5  
connecting to: enron_mail  
> db.getCollectionNames()
[ "messages", "system.indexes" ]
> db.messages.count()
501513  
>

Enter Transporter

Ok. We have a database full of data, now to transport that data out. First, we need to install Transporter. You'll want to download a binary release from the Github repository. We'll put the binary file for transporter in our working directory for now.

Now, there's two part to a Transporter configuration, the config.yaml file and the pipeline JavaScript file. The config.yaml defines the broad idea of what nodes will be talking in the transportation process. Here's the one we need:

nodes:  
  localmongo:
    type: mongo
    uri: mongodb:///enron_mail
  outfile:
    type: file
    uri: file://./enron_mail.json

In the node list, it defines a node named "localmongo", which is a MongoDB database (type: mongo) which can be reached by connecting to a URI mongodb:///enron_mail. It also defines a node called outfile which is, unsurprisingly, a file and gives it a URI equivelant to the file enron_mail.json in the current directory. With this structure defined, we use the pipeline JavaScript to connect them. In this case the file super-simple and is, in its entirety:

Source({name:"localmongo",namespace:"enron_mail.messages"}).save({name:"outfile"})  

We saved that as export.js. The Source() is given the name of the node to look up in the config.yaml and a namespace to read from. The .save says that the output of the Source() and save it to the node named in the parameters.

With the two files in place, we can quickly check the configuration with Transporter's test function:

$ ./transporter test export.js
TransporterApplication:  
 - Source:         localmongo                               mongo           enron_mail.messages            mongodb:///enron_mail
  - Sink:          outfile                                  file                                           file://./enron_mail.json
$

Looking good we can see the source and sink all ready to go. Let's run this then...

$ ./transporter run export.js 
$ ls -la enron_mail.json  
-rw-r--r--  1 dj  staff  1538903052 Aug 27 11:49 enron_mail.json

And if we look in that 1.5GB file, we'll find a lot of mail in JSON format, like this...

{
  "_id": "4f16fc97d1e2d32371003e77",
  "body": "Dear 1-800-FLOWERS.COM customer,\n\nWishing a dear friend or family member a Happy Thanksgiving just got easier!\nComplete their Thanksgiving Feast and send one of our beautiful floral\ncenterpieces or a sentimental gift to say you're thinking of them on\nThanksgiving Day.\nhttp://www.1800flowers.com/cgi-bin/800f/collection.pl/ewbwth/17/0/0/0/0/0\n\nIf you can't be there for the meal, send a delectable gift from our specially\nselected collection of GREATFOOD.COM treats and receive 10%* off.  Be sure to\nuse promotion code THX1 when you place your order to receive your 10%*\ndiscount!\nhttp://www.1800flowers.com/flowers/xt_quick.asp?r=ewbwgf&s=64\n\nVisit us today at http://www.1800flowers.com or find us at AOL keyword:\nflowers.\nFor the holidays and for all your gifting needs, our door is always open to\nhelp you find the perfect gift!\n\nAll the Best,\n\nYour friends at 1-800-FLOWERS.COM\n\nYou received this email because you are a 1-800-FLOWERS.COM customer. If you\nwould no longer like to receive these emails, please send an email to\nremove@1800flowers.com and indicate that you would prefer not to receive\nfurther emails from us about our products. Please do not reply to this\nmessage.\n\nThe 1-800-FLOWERS.COM privacy policy is available online at\nhttp://www.1800flowers.com/flowers/security/index.asp#privacy\n\nFor questions about an order, please email us at custservice@1800flowers.com\n\n*Exclusive of applicable service and shipping, charges and taxes. Items may\nvary and are subject to availability, delivery rules and items.  Offer is\nvalid for online and phone purchases. Offers can not be combined, are\navailable on all products and are subject to restrictions and blackout\nperiods. Offer valid through November 30, 2000. Void where prohibited.\n(c)2000 1-800-FLOWERS.COM, INC.\n",
  "filename": "522.",
  "headers": {
    "Content-Transfer-Encoding": "7bit",
    "Content-Type": "text/plain; charset=us-ascii",
    "Date": "Fri, 10 Nov 2000 10:25:00 -0800 (PST)",
    "From": "flowernews@aol.com",
    "Message-ID": "<5288974.1075854679633.JavaMail.evans@thyme>",
    "Mime-Version": "1.0",
    "Subject": "Complete the Thanksgiving Feast!",
    "X-FileName": "ebass.nsf",
    "X-Folder": "\\Eric_Bass_Dec2000\\Notes Folders\\Notes inbox",
    "X-From": "FlowerNews@aol.com",
    "X-Origin": "Bass-E",
    "X-To": "undisclosed-recipients:, ",
    "X-bcc": "",
    "X-cc": ""
  },
  "mailbox": "bass-e",
  "subFolder": "notes_inbox"
}

Because Enron's mail was full of spam too. Now, you can use whatever tools you fancy to slice (out the spam), dice and import that file... and that could include Transporter, because as well as exporting JSON, it can import it, filter it, reorganize it and send it on to a number of databases. We'll get onto that in a future part.