Use NiFi to Lessen the Friction of Moving Data
PublishedApache NiFi is a powerful data routing and transformation server which connects systems via extensible data flows. All types of data can stream through NiFi's customizable network of processes with real time administration in a web browser.
If data has gravity, as McCrory contends, then data movement has friction proportional to the data's size and the speed of the move. NiFi, a now open source data flow server, was conceived and nurtured inside the NSA to reduce the huge amount of friction from the constant movement of signals intelligence data.
The NSA open sourced NiFi, known as Niagra Files internally, through their technology transfer program in November of 2014, they had run it in production for the prior eight years. The NiFi codebase matured in that demanding environment but the NiFi community is still less than two years old. The original team, who also spun out with the project during the transfer, created a commercial company, Onyara, which was quickly purchased by Hortonworks. NiFi is now a top level project of the Apache Software Foundation.
NiFi's history is interesting, but so too is its functionality. As a visual data flow server, it provides some pieces like scheduling tasks and tracking processing steps which are akin to workflow management a la AirBnB's Airflow or Pinterest's Pinball but it also provides some pieces like buffering data and transforming content which are more akin to a streaming server a la Kafka or Google's Cloud Dataflow. It is a self-contained server which can scale down to running on a laptop and scale up to a very large cluster of instances. Well engineered, its core model is flow based programming centered around a simple abstraction named the FlowFile. It delivers this flow based functionality to end users via a web page. And it utilizes a couple of optimized data stores to manage the data in a performant way. All with a security focus that might be construed by some as paranoid.
Flow Based Programming with FlowFiles
NiFi abstracts flow based programming's notion of a message into a slightly more formal structure that is a set of metadata attributes with a pointer to a binary payload:
These are the simplest set of attributes (custom ones can easily be added).
And this is a formatted JSON content payload (a Pokemon tweet).
The payload is just bits as far as NiFi is concerned. These bits could be as small as a JSON message or as large as a multi gigabyte video or anything in between. NiFi doesn't really care. As a FlowFile flows through NiFi it mainly uses the metadata attributes to handle routing or other needs for decision making but that is an optimization so that the payload doesn't have to be read unless it's actually needed.
In the flow based model of programming processing is independent of routing. So, each step of the way for a FlowFile through the flow is separate. The above is a screenshot of the web UI. The boxes are the Processors. Processors hold configuration and are where the work is actually done when they are running. They can be independently scheduled and represent the extension point for the NiFi platform as a whole. The Processors on the edge tend to "hook up" to external systems: HTTP API endpoints(e.g. Twitter or AWS S3), databases(e.g. Mongo, Cassandra, or SQL), or other TCP services(e.g. IMAP or FTP). Once the edge Processor creates a FlowFile, it begins its journey through the flow. The "blue" Processors in the picture above represent a flow from one MongoDB to another (this example dedupes ids to brute force a continuous synchronization).
The Connections queue data between different Processors which keeps them uncoupled but can also allow for different processing speeds or spikes in quantities of data. Plus, Processors can make decisions to route to one Connection or even multiple Connections or to failure handlers too. The above even updates each step's stats such as counts and amounts of data processed in soft real time.
A Soft Realtime Web UI Built with SVG and D3.js
NiFi's UI is productive. Being able to start and stop Processors and even add new ones to a running data flow is useful. Being able to hook into currently running production flows to split the data into new ones for testing or staging is freeing. Stopping a Processor while the rest of the data flow executes is just fine too. Connections are queues so they will just buffer FlowFiles when there's no running Processor to take data from them. The independence of each of the pieces in the flow based model allows a data flow manager to accommodate many scenarios. Things like one off data dumps or synchronizing to a test or development environment or even transferring full data stores are all easy. Especially when compared to running a set of scripts at a command line.
While native applications for Mac OS X or Windows or Linux would certainly be able to deliver the utility of NiFi's UI, Scalable Vector Graphics (SVG) with D3.js are more than capable of delivering a rich interactive user interface in the web browser which is a strong point of NiFi.
Optimized Data Stores
Under the covers the FlowFile abstraction gives way to two data storage approaches tuned to the needs of the data. The content is stored in an append only log, called the Content Repository, on the basis that it should be immutable. The attributes are kept in a key-value store, called the FlowFile Repository, where they can be both rapidly processed and changed or added to as they pass through the system.
By matching these two different use cases of content and metadata to two optimized data stores NiFi removes a great deal of the "friction" from moving data from place to place and system to system.
The FlowFile binds these two implementations together and exposes them to the user in a flow. The user can then optimize even further by injecting some domain knowledge into a flow's design and ensuring that data is processed in whichever manner makes sense.
With some thought, architecting a performant flow can be accomplished for very large payloads by minimizing copying and complex flows can be accommodated too.
Less Data Friction with NiFi
One of the beauties of NiFi is that it works to lessen the friction of data flowing. NiFi uses a really nice abstraction of the FlowFile to split the problem into two optimized solutions for content and metadata. Plus the flow based programming model it delivers lets users inject domain knowledge to even further lessen friction by tailoring a flow to the problem all delivered in a rich UI. So, if you need to move some data, go with the flow and checkout NiFi.
In the future, we'll look at extending NiFi with custom Processors and we'll build an in depth example with more than a few steps. In the meantime, we'll leave you with this introductory screencast which walks through installing NiFi and creating an example data flow: