MongoDB & Full Text Search: My First Week With MongoDB 2.4 Development Release


At MongoHQ, we deployed and made available for use a hosted MongoDB 2.3 beta server to allow ourselves and our customers an opportunity to check out some of the new features that 10gen has been working on. Admittedly, a nice benefit of working at MongoHQ is we get to toy around with a lot of new and interesting technology.

From my initial reading of 2.3 release notes, I underestimated full text searching. Before we get started on what I wanted to talk about today, I wanted to mention one gotcha.

So, the gotcha …

CopyDB and Auth - Using the hosted MongoDB 2.3 server, I wanted to use some of MongoHQ’s tools to transfer data from an existing database to my new MongoDB 2.3 database. To do this, I used the built-in copyDatabase command, which MongoHQ exposes via their web-based UI. When I did that, I saw the following:

Tue Jan 22 14:58:44.314 [conn697] copydbgetnonce is not supported when running with authentication enabled  
Tue Jan 22 14:58:44.315 [conn697] copydb is not supported when running with authentication enabled  

I temporarily got around this using the mongodump and mongorestorecommand, but it looks like there are still some auth-related bugs to be cleaned up in the developmental release. No worries as that is what developmental releases are put out there to figure out.

Now, on to some details around full-text search…

Underestimating Full Text Searching

Full-text searching with MongoDB 2.4 is more complex and powerful than originally illustrated in our first blog post outlining this feature. In the original example, we showed:

db.emails.ensureIndex({body: "text"}, {name: "email_body_text_index"})  
db.emails.runCommand( "text", {search: "Pho"} )  

However, there are a number of other settings that are available for these query types, including:

So, I’ll quickly break down each one of these parameters, as each is pretty useful:

Field Weighting

The documented spec only allows one text index per collection, but it allows multiple fields on that single index. Initially, my thought goes to fields like tags, subjects, and fulltext. Using the following:

db.emails.ensureIndex({tags: "text", subject: "text", "body": "text"}, {  
    name: "email_text_index",
    weight: {
      tags: 5,    // Assume folks are better at tagging than writing
      subject: 4, // Assume the subject is better than the body at description
      body: 1

MongoDB uses a scoring algorithm to choose the most appropriate documents based on the weights and the number of matches.

Negative Terms (not yet documented) and Phrases

If you look at the code, you will see negative words are also possible. So, you run something like the following:

db.emails.command('text', {search: "Pho -'gangnam chicken'"})  

Running the above you would find mentions of “Pho”, without mentioning ‘gangnam chicken’.

Entire Document Indexing (i.e. wildcard)

Scanning through the documentation, I ran across this nugget: “$**”. With it, you can supply the entire document to full text searching like so:

db.emails.ensureIndex({"$**": "text"}, {name: "email_index_text"})  

If you are following along at home, you will need to drop the index above before adding this one

I am working with very large documents on our test database, and as expected the wildcard indexes were noticeably slower than specifying precise fields. This certainly makes sense, but is something to keep in mind as your data set grows.

Index Sizes

After this first week, I began thinking “how big are these indexes?” How much ram will they require? In our internal tests, we are not working with huge data sets on this project yet, but we were seeing wildcard indexes consuming about 1/3 of the total data size for the collection.

Given this, you should take this type of memory use into consideration as some long-running text searches could inadvertently shove a good chunk of your active data set out of memory. 10gen makes mention of this in their release notes, so make sure you keep this in mind as you plan out resources required to use this feature.

The Code

All of MongoDB is open-source. All of the full text search is modularized inside the fts directory at:

The significant files are:
- fts_matcher.cpp contains the matching algorithm – there are some undocumented nuggets in there. - fts_search.cpp contains the search runner and result builder. Even if you are just a Ruby or Node or PHP or Python developer and prefer to stay away from C++, reading code is always a good exercise. As you learn more about this feature, take a moment to learn more about the code that operates it as well.

We are excited about some of the capabilities that full-text search adds to MongoDB and looking forward to seeing this feature mature over the coming months.

Get Started

To get started testing the MongoDB 2.3 release, sign up for MongoHQ, and create a database on the “Experimental” plan.

Chris Winslett
Chris Winslett is a complex arrangement of Oxygen, Carbon and Hydrogen. Includes some salt. And beer. Love this article? Head over to Chris Winslett’s author page to keep reading.

Conquer the Data Layer

Spend your time developing apps, not managing databases.