To Go from RSS to Elasticsearch

As Compose's technical content curator, I get a lot of questions – questions like "Could you make us some data for Elasticsearch quickly". Usually I explain I'm not the curator of that sort of content. For a change though, in this I'm going to answer that question with a simple application which converts an RSS feed into an Elasticsearch index. In the process we'll look at parsing XML and generating JSON in Go, using the Elasticgo client library and a few other handy hints in the process. If you don't normally work in Go, this should give you a just a taster of the language and its ecosystem

We'll start from the core of the code, rather than go line by line; you can find the full code in our example repository. We've called the example RuSShes...

Parsing the RSS feed

Lets start with parse the RSS feed which is, of course, an XML formatted file. We're not going to try and handle all the flavours of RSS feed here, just the common RSS2.0 variant as found with blogs like Ghost and WordPress. The actual file has a preamble of metadata about the feed – title, version, link, description – and then has a set of item elements which contain the stories. Each of those items contain a title, a link, description, content, publication date and other bits of data, and a globally unique ID which identifies the item. If you have a look at https://www.compose.com/articles/rss/ you'll see this.

So how do you get Go to process XML? Go has the ability to attach metadata to variables in structures about how they should be handled by particular libraries. A variable's metadata is called a tag and it is entered as single quoted string after the variable's declaration. Go's XML library uses this tag to get information about what XML element names should be mapped to what variables when XML parsing happens. So for the preamble elements in the RSS feed, we can define a Go struct which has variables for information we may be interested in and tags for the XML library:

// Rss2 defines the XML preamble describing the feed
type Rss2 struct {  
    XMLName     xml.Name `xml:"rss"`
    Version     string   `xml:"version,attr"`
    Title       string   `xml:"channel>title"`
    Link        string   `xml:"channel>link"`
    Description string   `xml:"channel>description"`
    PubDate     string   `xml:"channel>pubDate"`
    ItemList    []Item   `xml:"channel>item"`
}

The text in single quotes after the declaration which begins xml: contains the XML parsing information in the tag. So, for example, the XMLName variable is populated with the contents of the <rss> element which opens the document and the Title variable comes from the <title> element within the <channel> element. For our purposes, this is mostly things we'll ignore for now. The important bit is at the end...

    ItemList    []Item   `xml:"channel>item"`

This says that there's <item> elements within the <channel> element and they should be gathered up into this array. Now, that would also be defined as a struct... something like this:

// Item describes the items within the feed
type Item struct {  
    Title       string        `xml:"title"`
    Link        string        `xml:"link"`
    Description template.HTML `xml:"description"`
    Content     template.HTML `xml:"encoded"`
    PubDate     string        `xml:"pubDate"`
    Comments    string        `xml:"comments"`
    GUID        string        `xml:"guid"`
    Category    []string      `xml:"category"`
    Creator     string        `xml:"creator"`
}

This is even simpler, in terms of XML mapping than the preable as all there's only one level of elements we're interested in. With these two structures defined we can go read the content from an RSS feed and parse it into usable data... Here's the snippet that does that...

    r := Rss2{}

First we create a variable of the Rss2 type struct we defined earlier. Then, given *feedurl is the URL of our RSS feed we get a HTTP client to read get a HTTP response from that URL...

    response, err := http.DefaultClient.Get(*feedurl)

We can then decant the body of that response into a byte array...

    xmlContent, err := ioutil.ReadAll(response.Body)

And finally feed that byte array into the XML unmarshaller, along with the Rss2 variable which will be populated according to all those hints we put into its and Item's definitions:

    err = xml.Unmarshal(xmlContent, &r)

Add in some error checking and we're parsing an RSS feed into Go data.

Talking to ElasticSearch

There are a number of routes to Elasticsearch from Go. The most basic route would be to roll-your-own HTTP/REST calls, but this can get tedious to do. So we're going to use one of the Elasticsearch client libraries, Elastigo. We'll import that like so:

import (  
    elastigo "github.com/mattbaird/elastigo/lib"
)

Assuming we have a URL for the Elasticsearch server which includes any username and password, we can simple create a new connection and set that URL as the destination. We'll assume that *esurl points to our URL:

    client := elastigo.NewConn()
    client.SetFromUrl(*esurl)

Now to insert a JSON document is as simple as...

    client.Index(*esindex, *estype, id, nil, jsonValue)

Where *esindex and *estype point to the Elasticsearch index and type names, id is a value to identify the document we're adding and jsonValue is a JSON document we want to insert. Nothing too odd, but where do we get our JSON document from.

Creating the JSON document

Remember how we used tags to define how XML would be parsed. Well, we can do exactly the same for JSON data. In this case it'll define how we create and name the keys in a JSON document. And like the XML part of tags was preceded with xml:, the JSON part is preceded with json:. For our purposes, what we need to do is just down-case the names of the keys. Here's the Item structure with those added:

// Item describes the items within the feed
type Item struct {  
    Title       string        `xml:"title" json:"title"`
    Link        string        `xml:"link" json:"link"`
    Description template.HTML `xml:"description" json:"description"`
    Content     template.HTML `xml:"encoded" json:"content"`
    PubDate     string        `xml:"pubDate" json:"pubdate"`
    Comments    string        `xml:"comments" json:"comments"`
    GUID        string        `xml:"guid" json:"guid"`
    Category    []string      `xml:"category" json:"category"`
    Creator     string        `xml:"creator" json:"creator"`
}

Now, we are ready to turn an Item into JSON. All we need to do is call json.Marshal() and pass it an Item:

    jsonValue, _ := json.Marshal(item)

Or if you want to have a nicely indented JSON string; handy for when debugging or demonstrating:

            jsonValue, _ := json.MarshalIndent(item, "", "    ")

We'll just assemble that with the Elasticsearch index command from earlier and we get...

    for _, item := range r.ItemList {
        jsonValue, _ := json.MarshalIndent(item, "", "    ")
        if *esurl != "" {
            client.Index(*esindex, *estype, item.GUID, nil, jsonValue)
        } else {
            fmt.Println(string(jsonValue))
        }
    }

Thats's using r, the output from our XML parsing code. There's one little change we've slipped in there; we're using the GUID of the RSS feed item as the id field for the document. If the feed behaves correctly, this should make it much easier to track changes with. But thats for another article. We're almost done apart from one question, where did that feedurl, esurl, esindex and estype come from.

Parsing the arguments

There's a few ways to parse command line arguments in Go. I've gone with Kingpin for this example because of it's fluent API. Here's its code extracted from the example:

import (  
    "gopkg.in/alecthomas/kingpin.v2"
)

var (  
    feedurl = kingpin.Arg("feed", "RSS Feed URL").Required().String()
    esurl   = kingpin.Arg("es", "Elasticsearch URL").String()
    esindex = kingpin.Arg("index", "ES index name").Default("rss").String()
    estype  = kingpin.Arg("type", "ES type name").Default("rssitem").String()
)

That var block is all the configuration your need to do. The example requires a URL for an RSS feed as a string; thats expressed with .Required().String(). The other arguments are optional – if no Elasticsearch is defined, the JSON results are printed to stdout. The index and type are defaulted using .Default(). Kingpin can handle IP and URL arguments and much much more, but we are sticking with strings here to keep things simple. If we run our example code with a --help parameter, Kingpin produces this:

usage: RuSShes [<flags>] <feed> [<es>] [<index>] [<type>]

Flags:  
  --help  Show context-sensitive help (also try --help-long and --help-man).

Args:  
  <feed>     RSS Feed URL
  [<es>]     Elasticsearch URL
  [<index>]  ES index name
  [<type>]   ES type name

Ready to run

So, now with everything in place, you can see, and build, the complete code from the compose-ex repository. Once built you run it with a RSS feed and optional URL to Elasticsearch and...

$ ./RuSShes https://compose.com/articles/rss https://user:pass@aws-us-east-1-portal3.dblayer.com:10653/ 
$

It will run and populate, by default, the index rss with rssitem type documents. You can check this by browsing your Elasticsearch instance with a plugin like ElasticHQ. That's it for this article. In the future, we'll be looking at how to efficiently update these indexes and how to search them effectively - and look at what we haven't done that would be an obstacle to that. Till then, have fun parsing and importing.