Discover Space on Twitter using Elasticsearch's Percolate Query

Published

Elasticsearch's percolate query is a great feature for alerting, monitoring, and categorizing documents in real-time. In this article, we'll explore how to use it by managing streams of tweets and gathering messages about space exploration.

If your application needs to provide users with real-time updates for data as it enters your system then you might want to consider using Elasticsearch's percolator query. With the percolator query, you turn searching around so that you're evaluating incoming documents based on stored queries rather than querying documents already stored in an index. This is really useful for a number of cases that involve real-time notifications and alerting, document classification, or aggregating streaming data to identify trends.

In this example, we'll show you how the percolator query works by monitoring what people are talking about on Twitter using the #NASA hashtag. To do that, we'll create some queries about planets and space terminology. Then we'll percolate the tweets that fit our queries and return the user and the tweet.

In the next part, we'll show you how to add metadata fields to the queries to make the percolator query even more powerful and add the results to other Compose databases for further analyses.

But before that though, let's get a little more familiar with the percolator query ...

What's a percolator query?

As we mentioned above, Elasticsearch's percolator query can be summed up as searching in reverse. What this means is instead of querying documents stored in an index, you store queries in an index then you percolate the documents against them to figure out which queries a document matches. Essentially, you're determining if documents match the query criteria, and if they do, then the response you get back from the percolator is an array of queries that match those documents.

Use cases for a percolator query are most commonly monitoring new documents of interest in real-time. This might include sending alerts to customers regarding price changes, weather alerts, or spotting patterns in documents, where keywords could be used to tag documents fitting certain criteria which are then later categorized and indexed.

The last use case is similar to what we're going to do with tweets. In our example, we're going to classify tweets based on whether they contain space terminology (astronaut, NASA, galaxy, interstellar, etc.) or planet names. Then we'll use the tweets that were percolated by Elasticsearch and tag on the Twitter handle and tweet.

Setting up queries

To set up the queries that we'll use for the percolator query, you'll first need an Elasticsearch deployment with a user set up. You'll need to create an index to store your queries, which can be done quickly via the Compose Elasticsearch Browser window that is accessed in the left-hand menu Browser button. Once the button is clicked, you'll be taken to the Browser window where you click Create Index in the top right. That'll open up a window where you'll enter the name for the index then click Run; ours is called "space_exploration".

Once that's done, we'll need to start figuring out what kinds of queries we want to store to get the appropriate documents. Queries are essentially documents; they are made up of fields like a normal document, and they also require a mapping type so that they can be stored in an index. However, unlike a normal document that requires only a single mapping, queries set up for percolation require at least two: one that has the fields we'll use in the query, and another that contains a field and type containing the stored query with a "percolator" type.

We're using Kibana to add the mappings and queries. Alternatively, use curl to query your Elasticsearch deployment.

PUT space_exploration/space_tweets/_mapping  
{
  "properties": {
    "tweet": {
      "type": "text"
    }
  }
}

The next mapping type will contain the percolator field and type:

PUT space_exploration/space_queries/_mapping  
{
  "properties": {
    "query": {
      "type": "percolator"
    }
  }
}

Both will give you back an acknowledgment from Elasticsearch that their creation was successful:

{
  "acknowledged": true
}

Adding queries

Now that the mappings are added, let's add the queries.

We'll make them very simple for this example. Since the queries are documents, they will also need to include the appropriate document metadata when inserting them into the index. Here's a list of documents that we'll add to the index using keywords like astronauts, Mars, Jupiter, astrophysics, interstellar, etc.

{"index":{"_index":"space_exploration","_type":"space_queries","_id":1}}
{"query":{"match":{"tweet":"astronauts"}}}
{"index":{"_index":"space_exploration","_type":"space_queries","_id":2}}
{"query":{"match":{"tweet":"mars"}}}
{"index":{"_index":"space_exploration","_type":"space_queries","_id":3}} 
{"query":{"match":{"tweet":"jupiter"}}}
{"index":{"_index":"space_exploration","_type":"space_queries","_id":4}} 
{"query":{"match":{"tweet":"astrophysics"}}}
{"index":{"_index":"space_exploration","_type":"space_queries","_id":5}}
{"query":{"match":{"tweet":"interstellar"}}}
{"index":{"_index":"space_exploration","_type":"space_queries","_id":6}} 
{"query":{"match":{"tweet":"galaxy"}}}
{"index":{"_index":"space_exploration","_type":"space_queries","_id":7}} 
{"query":{"match":{"tweet":"nebula"}}}
{"index":{"_index":"space_exploration","_type":"space_queries","_id":8}} 
{"query":{"match":{"tweet":"space"}}}
{"index":{"_index":"space_exploration","_type":"space_queries","_id":9}} 
{"query":{"match":{"tweet":"moon"}}}
{"index":{"_index":"space_exploration","_type":"space_queries","_id":10}} 
{"query":{"match":{"tweet":"pluto"}}}

Save these in a file with a .json or .txt file extension for instance. Ours is called queries.json.

Next, we need to add them to the index. We'll use the curl command to bulk insert them, which makes it easy to add lots of documents at once. To use the curl command, use the URI that's available in your deployment's Overview page and append _bulk to it. This is added to access the bulk API to insert more than one document at a time. Then set up a POST request like the following adding the file that contains the queries:

curl -XPOST 'https://user:pass@portal21-20.superior-elastic-search-2.compose-2.composedb.com:99999/_bulk' --data-binary @queries.json  

Once you execute that, it'll return a JSON document acknowledging that your data has been inserted:

{"took":650,"errors":false,"items":[{"index":{"_index":"space_exploration","_type":"space_queries","_id":"1","_version":1,"result":"created","_shards":{"total":3,"successful":3,"failed":0},"created":true,"status":201}}
...

Now, let's check to see what we get back if we give Elasticsearch an example of a tweet. First, we have to set up the percolator query, to match the tweets to the percolator queries, and run that using curl or Kibana.

POST space_exploration/space_queries/_search  
{
  "query": {
    "percolate": {
      "field": "query",
      "document_type": "space_tweets",
      "document": {
        "tweet": "nasa sends astronauts into space"
      }
    }
  }
}

The query starts out using the "query" field like any other search and uses "percolate" to start percolating the tweet. The "field" and "document_type" parameters are required, which refer to the field type, "query", that holds the indexed queries and the mapping of the document being percolated, "space_tweet". The "document" parameter, "tweet", is the source text being percolated. There are also other parameters referenced in the percolate query documentation if we want to percolate documents that are already stored in an index.

The result of that query gives us the following, which tells us that the tweet matched two queries (space, astronauts).

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 3,
    "successful": 3,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 0.2824934,
    "hits": [
      {
        "_index": "space_exploration",
        "_type": "space_queries",
        "_id": "10",
        "_score": 0.2824934,
        "_source": {
          "query": {
            "match": {
              "tweet": "space"
            }
          }
        }
      },
      {
        "_index": "space_exploration",
        "_type": "space_queries",
        "_id": "1",
        "_score": 0.2824934,
        "_source": {
          "query": {
            "match": {
              "tweet": "astronauts"
            }
          }
        }
      }
    ]
  }
}

But that doesn't give us a lot to work with. We can add a "highlight" field to our query that will give us back the tweet with the words highlighted using the <em> HTML tag. Run this:

POST space_exploration/space_queries/_search  
{
  "query": {
   ...
  },
  "highlight": {
    "fields": {
      "tweet": {}
    }
  }
}

Then you'll get this added to the response:

{
...
        "highlight": {
          "tweet": [
            "nasa sends astronauts into <em>space</em>"
          ]
...
        "highlight": {
          "tweet": [
            "nasa sends <em>astronauts</em> into space"
          ]
...
}

Great! Now that the percolator query is evaluating our text, let's see how we use it on a live stream of #NASA tweets from Twitter to get those that match our queries.

Setting up the application

You need to have access to a running installation of Node-RED. If you already have one up, read on; otherwise, dig into Power Prototyping with MongoDB and Node-RED, which gives you all the tools to get it running and understand how to set up nodes.

With Node-RED running, find the Twitter node and place the output node in your canvas.

You'll want to open that up by double-clicking the node and then click the pencil button next to the Twitter ID field to authenticate the node with Twitter. After authenticating, you'll come back to the editor and add in the for field the #NASA hashtag then click Done. Each tweet will then be passed along in msg.payload.

In order to limit the number of Tweets we get per minute, drag a delay node onto the canvas. Again, double-click the node to open the editor. Here, we'll select the Action as "Rate Limit" and the Rate as 30.

Then click Done and connect the "Twitter" node to the "delay" node.

Since we should have a stream of tweets coming in, we'll now create a function by dragging the function node onto the canvas. Double click the node then add in our percolator query above as the function and click Done:

msg.payload = {  
    "query": {     
        "percolate": {       
            "field": "query",       
            "document_type": "space_tweets",       
            "document": {
                "tweet": msg.payload
            }     
        }   
    },   
    "highlight": {     
        "fields": {       
            "tweet": {

            }
        }   
    } 
}
return msg;  

Connect that function to the delay node. Next, we'll have to make an HTTP POST request to send the query along with the tweet. Find the http request node, drag it onto the canvas and we'll edit it. In the Method field select POST and in the URL field put your Compose Elasticsearch connection URI with the "space_exploration" index name appended to it. Click Done then attach the node to the previous function node.

Let's check out output by attaching a debug node and look at what Elasticsearch is giving us. After attaching the node, click Deploy then open the debug panel. You should have a number of tweets shown in the debugging console where we can view the tweet and the matching queries. But, this is not enough.

If you look at the tweets that we're getting back, you'll find some that don't match any of our queries. It's good that we're getting some data back, but we don't exactly want those tweets to be returned. To filter those out, we can create another function by dropping in another "function" node onto the canvas. This time, we'll tell it to only pass on responses that match queries from Elasticsearch. Place the following code in that node and click Done when you're finished:

if (msg.payload.hits.total > 0) {  
    var message = [];
    for (var i = 0; i < msg.payload.hits.hits.length; i++) {
        message.push(msg.payload.hits.hits[i].highlight.tweet[0]);
    }
    msg.payload = {
        username: msg.tweet.user.screen_name,
        tweets: message
    }
    node.send(msg);
}

This code filters out the hits we get from Elasticsearch. If the hits array doesn't have any data, then the tweet is discarded. But, if there are hits, we'll then loop through each hit in the array, get the highlighted tweet(s), and pass those into the message array. Those are all pushed into an array called messages. From there, we'll create a new msg.payload object to be sent that includes an array of tweets from the message array and the username of the individual who tweeted. We can access the username using msg.tweet object and getting the screen_name of the user. Then we'll send the entire msg object.

Connect this node to the http request node and attach the debug node to that. Then, click Deploy again. When you open up the debug panel this time, you should see the username and the tweets with the <em> HTML tag highlighting the matched words.

A Next Small Step ...

Now that we have the tweets coming in and being labelled, we'll stop there. Get creative and add more queries or fine tune the results in the code. In the next installment, we'll pick up where we left off and look at storing the saved tweets back into Elasticsearch, and show you how to push them into other Compose databases to do some statistical analyses.


Read more articles about Compose databases - use our Curated Collections Guide for articles on each database type. If you have any feedback about this or any other Compose article, drop the Compose Articles team a line at articles@compose.com. We're happy to hear from you.

attribution NASA

Abdullah Alger
Abdullah Alger is a former University lecturer who likes to dig into code, show people how to use and abuse technology, talk about GIS, and fish when the conditions are right. Coffee is in his DNA. Love this article? Head over to Abdullah Alger’s author page to keep reading.

Conquer the Data Layer

Spend your time developing apps, not managing databases.