Shipwrecked! A MongoDB Data Recovery Tale


It was September 19, International Talk Like a Pirate Day, when a friend-of-a-friend (we'll call him Cap’n Pur­ple­beard) hailed our jolly crew with a tale of woe. The friend-of-a-friend had not been running MongoDB with us, but had been running MongoDB at a budget VPS host. Their database on the bud­get VPS host had worked fine, until the host had a hardware crash. And as is all too usual in these stories, their self-hosted platform had no usable backups less than about six weeks old.

Close to half of the Cap'n's database contents were newer than that and therefore weren't in the backup. File recovery on the crashed server yielded a number of MongoDB data and namespace files in various states of disrepair. Some looked mostly intact, while others were overwritten with parts of unrelated files or otherwise corrupted.

Cap'n Purplebeard asked if we could salvage any usable records from the debris. Our mission was set: use all our nautical tricks to snatch back the Cap'n's data from the scurvy clutches of Davy Jones! Arrrr!!!

Heading to Sea: getting the data recovery started

When it comes to MongoDB corruption the obvious solution is to run mongod --repair, which seemed hopeless. We tried it anyway, but of course this wouldn't be an interesting blog post if that had done any good. The files were too scrambled, and mongod simply logged some error messages and shut itself off. We had to pursue more drastic measures and wrassle the recovered file remnants with our bare hands.

Our co-skipper Ben Wyrosdick got us under way by setting up a chat channel and finding a few old MongoDB Inc. presentations about the structure of MongoDB data, discussing MongoDB internals and storage and data safety .

These explained the separation of namespaces (.ns files) and data extents (.0, .1, etc) and told what was within each type of file. We ended up not using the presentations' contents much per se, but they helped us get our "sea legs" in understanding what we had reeled in. At crewmember Nick Stott's suggestion, we started by extracting intact documents from the data extents directly, rather than face the more intricate challenges of interpreting the namespace files, journals, b-tree indices within the data extents, etc. right away. Our "bottom-up" BSON-only approach (as we'll see) was successful enough that we ultimately never dug into the .ns files at all.

Fit the First: extracting BSON from the data extents

Our first experiment simply scanned one of the data extent files for valid BSON strings and extracted them to see what they held. As you might know, BSON is the data format MongoDB uses for storing and transmitting documents. You're probably familiar with JSON. BSON is similar except more efficient due to being a binary format. On the other hand, unlike JSON, it's not human-readable--you need parsing software, which exists for many languages but is not always very reliable.

PyMongo, the MongoDB Python driver, comes with a BSON library that we've used in some other projects, so it was the first thing we reached for when this situation arose. Each BSON record starts with a 4-byte header giving the record length as a 32-bit little-endian binary integer. So scanning for documents was just a matter of looking at every character position in the file, reading 4 bytes and seeing if they represented a reasonable length N, and if yes, reading the N bytes and checking if they formed a syntactically valid BSON document. We decided for the moment to not try to find usable data fields in the interior of documents that had been partially trashed. That would have been more complicated, and we ended up not needing it.

We implemented the above scanning scheme and pulled out a few documents (in JSON format for easy viewing) that Cap'n Purplebeard recognized as coming from his database, so he told us we were onto something. But unfortunately, we also got a lot of hits that were obviously spurious because they came from overlapping character ranges in the file. There were thousands of these, too many to inspect by hand. We needed a better way to recognize documents of interest.

Fit the Second: recognizing complete BSON documents

Cap'n Purplebeard had already told us that his database had around 20 collections. He now gave us a sample document with a complicated nested structure, which looked tricky to recognize because of all the possible combinations and corner cases. Here, the Cap'n's out-of-date backup came in handy. We loaded it into a Mongo server and ran a simple client that sampled each collection and pulled the top-level attribute names from the sampled documents. This gave us a crude (/articles/schema-less-is-usually-a-lie/), and that helped a lot in this situation.

We now modified our scanning script to check each surfaced document for the presence of all the fields in each collection's schema, logging a warning message if more than one schema matched, or throwing away the document if no schema matched. We ran the modified script on the first of the extent files, pulling out a few hundred records. We never triggered the message about matching more than one schema (then or later), so the schemas were enough to disambiguate the records in every case. The schema matches also let us conveniently write separate output files for each of the original database's collections. Now it was just a matter of a little more tweaking and then processing the rest of the files... but uh-oh! That would have been too easy.

We didn't notice at first that some of the BSON records in the file were so hornswaggled that they had made the Pymongo BSON library itself "walk the plank", crashing the Python interpreter with a segmentation fault after processing only about a third of the file. Pulling the latest Pymongo commit from Github got the script a few percent further before it crashed in about the same way as before. We'd have to do something about this crash before we could make anymore progress.

Fit the Third: working around a BSON parsing bug

The day was wearing on and the prospects of debugging the parser, or converting our whole script to another language with its own possibly-buggy BSON parser, both sounded dreary. Pymongo uses a C extension to parse BSON faster, and after a few false starts, we found that its setup script had a build option to disable the extension and parse with pure Python instead. We rebuilt the driver with the --no-ext option as the C extension was the obvious suspect for harboring the segfault. Then we restarted the scanning script from the top.

The pure Python parser got us past the crash at the cost of a slowdown in processing speed. It took about 2 hours to scan all the data files on a developer workstation and we got a lot more records that way. But it was also apparent that we'd need several more runs to tweak parameters and try different things. We could optimize (maybe with a C++ library encountering more crashes and debugging), we could accept waiting long times between runs, or we could bring on more hardware firepower to speed things up. We opted for the latter.

Pieces of Eight: parallel MongoDB document extraction

MongoHQ's proud beauties, the SSD replica sets, sail aboard our fastest vessels--servers powered by dual Intel E5-2670 octocore processors for a total of 16 physical CPU cores or (because of Intel hyperthreading) 32 logical cores. We had one of these men-o'-war "in harbor" and unfreighted, so we wrote another script to chop the input files into 60 or so overlapping pieces small enough to individually process in a minute or two each. The overlap was necessary to avoid splitting an intact record across the boundary between two pieces without leaving a complete copy anywhere. It meant that a record within the overlap would be duplicated across the pieces, but that is ok since MongoDB automatically de-duplicates the records as they are inserted.

We then used the handy and under-appreciated GNU Parallel program to run copies of our scanning script on all the pieces independently. GNU Parallel automatically schedules and launches new processes to maximally use the available cores. That spared us writing our own parallel execution wrapper with task queues and worker threads, the usual approach to such problems. The output of the "time" command looked like this:

real  4m29.563s  
user  109m18.146s  
sys  5m51.534s  

You can see that we got almost two nominal CPU hours in just 4.5 minutes of elapsed time. This is slightly misleading since it counts the hyperthreaded contexts as if they were all real CPU's when (by our other experiments) hyperthreading only gives about a 20% speedup for this type of program; i.e., each logical core is equivalent to about 60% of a physical core. The overall speedup from 32 threads running flat out is around 20x a single thread, for these independent processes with no cache or lock contention to speak of. On the other hand, at the tail end of each run as the last few segments were processed, not all the cores were in use, leaving a few compute cycles on the table.

In any case, the speedup we got was enough that we could make 5-minute runs "interactively" while staying in conversation about what to try next, so we didn't worry about optimizing further. A two-hour wait between runs would have turned a real-time collaboration into a much slower one, probably done by email over multiple days.

If we hadn't had the 16-core server available or if the dataset had been much larger, our best bet might have been to shoot a few doubloons to our mates at Amazon Web Services to hire a formation of high-CPU instances from their EC2 fleet for the evening. Distributing a task like this across multiple machines is a bit more complicated than having everything on a single SMP machine, but GNU Parallel can do it with the right options set.

Sailing Home: finishing the data recovery

After a few rounds of manual schema editing by Cap'n Purplebeard between runs, and a few other fixes on our part, we delivered around 600k JSON records for him to check and we called it a night. The next day he told us that we had successfully recovered about 96% of his data, so he was very happy. He modified the schema file one more time and we did another run that picked up a small collection added after the backup was taken. He was able to reconstruct the most critical parts of the missing 4% by other, more tedious means, and that let him get his site back on the air. We never had to mess with document fragments or B-tree indices. His final recovery rate after merging his existing backups was 99.87%, which we consider phenomenal considering how extensively the files had been damaged.

We later learned that we also unintentionally created a few "zombies", i.e. we restored some documents that Cap'n Purplebeard's application had created and later deleted during its normal operations before the crash. We had thought from earlier discussions that these deletions weren't likely to have occurred. But Cap'n Purplebeard was able to find and re-delete the zombie docs without much trouble.

After we sent the recovered docs, Cap'n Purplebeard's team used 1 developer-day to get their site back to a "usable" state, and 2.5 additional developer-days to finish restoring all the collections. They received no user complaints since the recovery, so they consider it to be a success. This is hugely satisfying for us. But we had better profess that there was a lot of good luck all around: we're surprised that the recovery effort and user disruption wasn't larger, and that the recovery rate was so high, especially with such low-tech methods. It may have been a case of fortune favoring the bold.

During the salvage operation the Cap'n sent a cargo of beer and munchies to both of our offices, that we enjoyed for weeks afterwards. And of course, he's moving his MongoDB hosting over to MongoHQ. Thanks, Cap'n!

Epilogue: source code release on Github

We hope none of you are ever in a situation like Cap'n Purplebeard's. But in case you're unlucky, we've released the scripts we used on Github. The repo has a made-up schema file since we have to keep Cap'n Purplebeard's real one private. Given the scripts' one-off nature, they aren't well-parametrized or packaged. You'll probably have to adapt them to your own purposes anyway, if you have to use them at all.

Until then, happy voyages!

Conquer the Data Layer

Spend your time developing apps, not managing databases.