Deeper on 3.2 - Partial Indexes and Validation

There's plenty of new features in MongoDB 3.2 and a lot to take in, especially as with many of them, it's not what they do but what they don't do that leads to the best understanding of them. So let's begin the tour with Partial Indexes and Validation.

Partial Indexes

For those of you with large datasets, partial indexes may provide a helping hand. They basically allow you to index only the documents in a collection that match a filter. So, for example, say you had a database of users who were either active or inactive...

{
   id: 
   username: "fred"
   active: 1
   ...
}

I could index the username field, but if I'm only interested in active users, then only indexing the active users when creating a index would potentially reduce index size at the cost of some performance. Well, now we can do that:

db.people.createIndex(  
                      { username:1 },
                      { partialFilterExpression: { active: { $eq: 1 } } }
)

Of course, there are catches. You have to have something like the filter expression in your query for the query planner to know it can use the partial index. That means that...

db.people.find( { username:"fred", active: { $eq:1} } )  

Partial Indexes can't do magic for you. You'll need to create an index using a partial filter that matches what you do in your queries for the index to be used. But do remember that too many indexes will hurt your write performance so also be sparing with partial indexes too, even if the temptation to match many queries with their own partial indexes is high.

They are to be preferred over sparse indexes which can only work when an index field is, or is not, present in a document. Partial indexes can test fields that aren't being indexed as above. It's a useful optimization, but as with all optimization, be wary of premature application.

Validation

The ability to define validation rules for a collection is also new in 3.2. Validation with MongoDB has traditionally been done up in the application layer. With 3.2 onwards it's possible to let the database do more of the validation work.

With MongoDB 3.2, the validation feature lets a collection have a "validator" expression specified as part of a collection's metadata. The validator expression is a match expression which must be true for a document to be inserted or updated. It can use most of the query operators, but lacks near geomatching, full text searching and where expressions.

The MongoDB documentation offers this as an example:

db.createCollection( "contacts",  
   { validator: { $or:
      [
         { phone: { $type: "string" } },
         { email: { $regex: /@mongodb\.com$/ } },
         { status: { $in: [ "Unknown", "Incomplete" ] } }
      ]
   }} )

This is a good example of a very lax validator. It makes sure the phone field is a string, the email matches a regular expression and the status is one of two string values by way of the $in operator.

It isn't a very strict validator though. The validator expression returns true to pass the document as valid and here the $or which binds the three tests together will return true if any one of the tests is true. That means that...

db.contacts.insert( { phone: "118118118", status:"Not A Status" } )  

... would pass as valid despite the status field definitely not being a valid status.

Why not change that $or to a $and you many wonder. Well, you can, but now every field in the rules has to be present and that means even though the fields specified in say...

db.contacts.insert( { phone: "118118118", status:"Incomplete" } )  

...are valid, the lack of an email field means it fails.

Validation effects

When a document is passed as valid, things continue as usual. If the document fails validation the change may be rejected with a error 121, a DocumentValidationFailure. We say "may" because there's another setting validationAction which determines that. By default this is set to error and will generate that 121 error. But it can also be set to warn which lets the operation succeed but notes the validation fail in the log.

There's also validationLevel that controls how strict the validation is. It defaults to strict which means all document operations must pass validation, but also has the setting moderate which eases the validation on existing documents in the collection. You can use these to your advantage when migrating a collection to use validation. You can add validation to an existing collection with collMod:

db.runCommand( {  
   collMod: "books",
   validator: { $and: [ { isbn: { $exists: true } }, { title: { $exists: true } } ] },
   validationLevel: "moderate",
   validationAction: "warn" } )

Now, there's the lightest of validation in place and the logs will note when validation has failed. Currently on the beta Compose MongoDB+, the logs are available within 24 hours.

There is another setting for the validationLevel which is off which, as you'd expect turns off validation, without removing the validator.

Validate catches

Validation is a powerful way of maintaining the consistency of your MongoDB data. There are some catches to using it. First, there's the issue of managing what validations are in place. You can retrieve the validator through db.getCollectionInfos() but there's no way, at least currently, to version the validator.

The other big issue is how the validation reports rejected documents. As we said, it throws the 121 error and thats it. To work out why the rejection happened, developers will have to write code that mirrors the validator in the application stack. This in turn means they'll have to go to the database to retrieve the validator or make assumptions about what validation is going on.

So, MongoDB Validation is familiar with its use of existing query syntax, can be focussed on only the parts of the database where you want validation enforced, it shouldn't have an impact on performance and it's pretty easy to deploy. Ideally, test your validations and prepare your legacy data clean up in a staging environment because it's likely there will be some big bulk operations taking place with any reasonably sized dataset.