Counting & Not Counting with MongoDB

TL;DR: In certain situations, count can be avoided for better performance.

Counting things can be hard on a database. The obvious things to count, number of items in a collection or index, are easy to keep track of, but as soon as you start trying to count records based on a query, things get slower and slower. This is because the database has to do more work scanning practically every item in the database to see if it matches; indexes can help to cut that scanning down but not as much as you might think.

The road to performance can often be found by stepping back and looking at the wider problem. For example, one customer had been finding that their count operations were slow. Tens of millions of documents were being searched through in their count queries. But when we stepped back from that specific issue it revealed that the actual value of the count was not important to them. What they wanted was to trigger a particular process when the number of documents that matched their query exceeded a particular threshold.

Let's not talk in the overly abstract and get some data to experiment with. Don't have ten million records to exercise your database? This snippet of code to run in Mongo shell will make you a collection to fix that. It creates ten million record each with three fields, a, b and c set to random values between 0 and 99:

function randInt(n) { return parseInt(Math.random()*n); }

for(var j=0; j<10; j++) {  
  print("Building op "+j);
  var bulkop=db.countTest.initializeOrderedBulkOp() ;
  for (var i = 0; i < 1000000; ++i) {
    bulkop.insert({a: randInt(100), b: randInt(100), c:randInt(100) })
  };
  print("Executing op "+j);
  bulkop.execute();
}

Say we have a desire to know how many records are of a particular state; that two fields are in a higher range of values but another is in the low range because this might denote a problem. We could come up with a query like this...

db.countTest.count( { a : { $gt: 50 }, b: { $lt:50 }, c: { $gt:50}} );  

And on our test server that takes around ten seconds to execute. The server has to do through every record to find the 1.2 million(-ish) that match and it doesn't know its done until it has counted every one. Say though that 10000 matches is out threshold and however many more than that is useless information. If only we could limit our searching to finding the first 10000 matches and returning after that. There's no way, though, to limit the number of matches with the .count() command.

But there is way to limit the number of documents returned by a find query, and we can easily translate the .count() method's query into a .find(). We can then use the projection option in find to reduce the number of returned fields to 1 (the _id). Now we can use the .limit() method to restrict the number of returned records to the threshold value and then .count() the results...

>db.countTest.find(  { a : { $lt: 50 }, b: { $lt:50 }, c: { $lt:50}},
              { _id:1 }).limit(10000).count()
1224914  
>

And thats not right, it took just as long! What's happened is count has ignored the limit and counted, and scanned all the records again. If we consult the manual though we find that this isn't a fixed behavior, just the default behavior. If we went count to pay attention to limit (and skip), we need to give it the applySkipLimit option set to true. So let's add that option in:

>db.countTest.find(  { a : { $lt: 50 }, b: { $lt:50 }, c: { $lt:50}},
              { _id:1 }).limit(10000).count( { applySkipLimit: true } )
10000  
>

Wall-clock time on this query was 68ms. The explanation for this boost is that, in all it we only needed to scan around 80000 records before it found 10000 that matched the query. We know it found 10000, as it returns 10000 which for our purposes means there's at least 10000 matches in the database.

Let's tweak the query though to understand it better... this time looking for a greater than 75, and b and c less than 25. First with the "count" form:

db.countTest.count( { a : { $gt: 75 }, b: { $lt:25 }, c: { $lt:25}} );  

That takes around ten to twelve seconds to execute too. Now in our "explain" form:

db.countTest.find(  { a : { $gt: 75 }, b: { $lt:25 }, c: { $lt:25}}, { _id:1 }).limit(10000).count( { applySkipLimit: true })  

That takes 800ms to execute; it's still faster, but now, with a more precise query, less records match so more records need to be scanned – given the random distribution of created records, around six to seven hundred thousand records now. The other way to increase scanned records and execution time is to increase the threshold. If we bump the threshold/limit up to 100,000, our query takes over six seconds which is hardly surprising as its also had to scan six and a half million records.

The timings helps inform us when the limit method of counting is worth using - specifically when there's a low threshold value but a particularly high number of records that will match in the database, along with a relatively complex query and a particularly large data set. If you have that situation, then you might try swapping out some directly calling count in preference for a "find().limit().count( { applySkipLimit: true } )" formulation. Remember, of course to still check the result reaches your threshold value.

By not counting the irrelevant records beyond the threshold though, you could well be saving yourself seconds on queries and getting a smoother more responsive MongoDB as a result.