Sizing & Trimming Your MongoDB

TL;DR: - Space is big and your database will fill it happily so here are commands and techniques to tame that space filling.

How big is your MongoDB database? There's the number and size of the records in the database itself and that's what a lot of people consider the main measure of database size. But that's only part of the measurable size. If it's on your own disk drive you may not be bothered about overly large databases, but in the cloud where you are billed on the amount of resources consumed, we're talking about real money and how to save it.

Being practical, let's look into one the example databases here and see what we can do with it. Our first stop in measuring the database is the db.stats() command which offers a selection of statistics. If we log in to the database with the Mongo shell and run that...

exemplum/19:11:33>db.stats() {  
  "db" : "exemplum",
  "collections" : 10,
  "objects" : 110857,
  "avgObjSize" : 239.9243349540399,
  "dataSize" : 26597292,
  "storageSize" : 43438080,
  "numExtents" : 25,
  "indexes" : 10,
  "indexSize" : 4210640,
  "fileSize" : 2666528768,
  "nsSizeMB" : 16,
  "dataFileVersion" : {
    "major" : 4,
    "minor" : 5
  },
  "extentFreeList" : {
    "num" : 47,
    "totalSize" : 2457366528
  },
  "ok" : 1
}

Let's start with the headline statistic – the filesize value is how big all the data files used to store this database are and it's that number which is typically used for billing purposes.

The numbers here are in bytes and that 2666528768 bytes means 2.48GiB of file space being used. But, the other numbers will tell a different tale. Let's look at the storageSize – this is the amount of storage that's been allocated to hold the documents – which comes in at 43438080 bytes, 41MiB. That's quite a difference. There aren't even some large indexes as indexSize is only 4210640 bytes, 4MiB.

Crunch Time

What is going on? Remember we said this was one of our example databases. One of the things we do with this database is test things like bulk writing and removal. Sometimes we create collections, fill them, query them, delete the records then delete the collection. But MongoDB doesn't give back that file space immediately, or at all; once it has allocated space the database wisely retains it once freed up, ready to be used again when it needs it. It's thinking "If someone's put a million records into me, there's a good chance they'll do it again". That free space is recorded in the extentFreeList and if we look in there we find the totalSize is 2457366528 bytes, 2.28GiB. This is where the majority of our fileSize is coming from.

So what can we do about it? Well, running db.repairDatabase() is the best way to recover disk space. The operation will block access to the database, so its running is not to be taken lightly. While running it will effectively run a compact command on every collection, taking the collection and rewriting it in a new collection, reindexing it and then swapping it back into place. The repair also reclaims other space from around the database. Let's do that on our example database and see how it looks afterwards...

exemplum/19:13:10>db.stats() {  
  "db" : "exemplum",
  "collections" : 10,
  "objects" : 110857,
  "avgObjSize" : 239.9301442398766,
  "dataSize" : 26597936,
  "storageSize" : 41345024,
  "numExtents" : 24,
  "indexes" : 10,
  "indexSize" : 3589264,
  "fileSize" : 117440512,
  "nsSizeMB" : 16,
  "dataFileVersion" : {
    "major" : 4,
    "minor" : 5
  },
  "extentFreeList" : {
    "num" : 0,
    "totalSize" : 0
  },
  "ok" : 1
}

This is a huge improvement. The fileSize is now down to 117440512 bytes, 0.1GiB, or more usefully 112MiB. The extentFreeList now shows no bytes being retained for future allocation. The storageSize has decreased too, only by a few MiB, but that amount could well be much more in most databases. Like the fileSize, the storageSize also increases and doesn't decrease unless action is taken. The only size value that changes naturally with the addition or removal of records in the db.stats() list is dataSize because that reflects the number of bytes of data held in the collections. If you make the content of the records stored smaller though, dataSize doesn't shrink either.

Padding Power

That's because the dataSize also includes the padding used to make expanding and contracting records on disk easier. Before MongoDB 2.6, the default for creating this padding was based on a paddingFactor; the first record inserted into a collection would set the paddingFactor to 0, but updates and record expansion could push the factor up to 2. When a new record is created, the size of the document multiplied by the padding factor gave how much space needed to be allocated. Things changed in MongoDB 2.6 where a "Power of 2 Sized Allocations" became the default. With this, a record of say 76 bytes is allocated 128 bytes of space because that's the nearest power of 2. Using powers of 2 makes it less likely that re-allocations will occur under typical workloads, but if you rarely update the records in a collection, you can always turn off usePowerOf2Sizes and use the older "exact fit" method and get better use of storage space.

If you want to get information about a specific collection, use db.<collection name>.stats() which will give you information about that collection including the paddingFactor (which exists even if "exact fit" isn't in use) - if you want to confirm usePowerOf2Sizes is being used, the userFlags field will be set to 1. The collstats manual page has more on the collection stats.

A quick note about both the .stats() methods. They can take a scale value as a parameter and many of the size values will be divided by that number and then rounded to make whole numbers. Now, the scale for KiB values is easy, 1024, but you may or may not remember 1048576 as the scale factor for MiB or 1073741824 for GiB. If you are in the shell, let it calculate it for you and enter 1024*1024 for MiB and 1024*1024*1024 for GiB. Or, if you find yourself using the values in the shell a lot, add

MiB=1024*1024  
GiB=1024*1024*1024  

to your .mongorc.js file and have the values always available.

Putting a Cap On It

There are ways to stop an ever-expanding database though, at least for certain use cases. Say you are only interested in the last 100,000 records in your database, you can create your collection as a capped collection and set it to only hold 100,000 records. When there's 100,000 records in the collection and another is inserted, the first record that had been inserted is discarded. The limit on the collection size also means that insertion and query speed stay relatively high too.

If capping the collection is too much of a blunt instrument, another automated pruning option is setting an expiry TTL (time to live) on a collection. There are a lot of caveats on creating such a collection, but it does allow you to, for example, retain records in a collection for an hour, making it ideal for web session data.

The hints above should help you keep a handle on your space consumption, but they aren't the whole story. Users of MongoHQ's elastic deployments should consult their administration dashboard for the current amount of space consumed. If you have any concerns about how your MongoHQ database is consuming space, just drop us a note and we will be happy to help.