Finding Duplicate Documents in MongoDB

Published

Need to find duplicate documents in your MongoDB database? This article will show you how to find duplicate documents in your existing database using MongoDB's aggregation pipeline.

Finding duplicate values in your database can be difficult, especially if you have millions of documents to look at. MongoDB's aggregation pipeline makes finding duplicate documents easier by allowing you to customize how documents are grouped together and filtered. In other words, MongoDB lets you select fields and group together documents based on your selection in order to find duplicate documents.

The dataset that we will be using is a CSV file containing a list of 550 failed banks compiled by the Federal Deposit Insurance Corporation (FDIC). We've altered the list for this article to include duplicate banks, so our list now includes 559 banks which can be downloaded here.

After downloading the CSV file, let's import the file into our Compose MongoDB deployment. To insert the CSV file, we'll be using mongoimport to create a database called banks and a collection called list. Make sure to use the credentials for your deployment.

mongoimport --host aws-us-west-2-portal.0.dblayer.com --port 99999 --db banks --collection list --ssl --sslAllowInvalidCertificates -u user -p mypass --type csv --headerline --file banklist.csv  

One you've run mongoimport in the terminal, you should see that 559 documents have been successfully imported into your database. Now, let's log into our MongoDB shell and look at our documents.

Working with the Bank Data

In the MongoDB shell, our data should look similar to this:

db.list.findOne();  
{
    "_id" : ObjectId("5900b01b2ce12a2383328e61"),
    "Bank Name" : "Seaway Bank and Trust Company",
    "City" : "Chicago",
    "ST" : "IL",
    "CERT" : 19328,
    "Acquiring Institution" : "State Bank of Texas",
    "Closing Date" : "27-Jan-17",
    "Updated Date" : "17-Feb-17"
}

Each document is provided with a unique ObjectId by MongoDB and also a unique CERT number provided by the FDIC. The CERT number is unique to each branch of a bank. The CERT number is the number that we'll be using to identify duplicate banks. We won't be looking for duplicate Bank Names since a bank can have multiple branches and have the same name.

Now, let's start looking for duplicate banks in our dataset using MongoDB's aggregation pipeline operators.

Finding Duplicates with Aggregations

You can find duplicate values within your MongoDB database using the aggregate method along with the $group and $match aggregation pipeline operators. For a closer look at MongoDB's aggregation pipeline operators see the article Aggregations in MongoDB by Example.

Let's start out with how to use the aggregate method in MongoDB. We'll apply the aggregate function to our list like this:

db.list.aggregate([ ... ]);  

The aggregate function takes an array of aggregation operators. The $group operator is the first one we want to use, and we want to set it up so it groups our documents by the CERT number.

db.list.aggregate([  
    {$group: { 
        _id: {CERT: "$CERT"} 
        } 
    }
]);

We'll need an _id field first, which is mandatory because it indicates what we are grouping by. In this case, we are grouping together the CERT numbers so $CERT is used to point to the CERT field. The key CERT within _id can have any name.

Running the aggregation using only the $group pipeline operator, we'll get something like the following below from the MongoDB shell. However, the $group pipeline operator doesn't order documents so you may get varying results:

{ "_id" : { "CERT" : 6629 } }
{ "_id" : { "CERT" : 34264 } }
{ "_id" : { "CERT" : 32646 } }
{ "_id" : { "CERT" : 22002 } }
{ "_id" : { "CERT" : 24382 } }
{ "_id" : { "CERT" : 19183 } }
{ "_id" : { "CERT" : 33784 } }
{ "_id" : { "CERT" : 16445 } }
{ "_id" : { "CERT" : 27094 } }
...

In order to view documents that have the same CERT number, we'll need to collect them together and show their IDs. To do that, we can create another field to preserve the IDs as we move through the pipeline. For now, let's name it uniqueIds, but you can name it whatever you'd like. What we'll do is associate a CERT with all of the MongoDB documents that have the same CERT number in an array. The array will be created with the $addToSet operator and it will include all the ObjectIds of the documents with the same CERT number using each document's _id field represented by $_id. If there is more than one ObjectId associated with a CERT number in the uniqueIds array, then we've found a duplicate.

db.list.aggregate([  
    {$group: {
        _id: {CERT: "$CERT"},
        uniqueIds: {$addToSet: "$_id"}
        }
    }
]);

The result of this query looks like the following:

...
{ "_id" : { "CERT" : 34485 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a2383329007") ] }
{ "_id" : { "CERT" : 34682 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a2383329005") ] }
{ "_id" : { "CERT" : 22130 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a2383329003") ] }
{ "_id" : { "CERT" : 28312 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a2383329001") ] }
{ "_id" : { "CERT" : 57017 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a238332904c") ] }
{ "_id" : { "CERT" : 27197 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a2383328fff") ] }
{ "_id" : { "CERT" : 34979 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a238332907b") ] }
{ "_id" : { "CERT" : 25231 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a2383328ffe") ] }
{ "_id" : { "CERT" : 20203 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a2383328ffd") ] }
{ "_id" : { "CERT" : 35030 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a2383328ffb") ] }
{ "_id" : { "CERT" : 31293 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a238332908c"), ObjectId("5900b01b2ce12a2383328f23") ] }
{ "_id" : { "CERT" : 2303 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a238332908d"), ObjectId("5900b01b2ce12a2383328f22") ] }
...

At this point, we could stop and check to see which groupings have more than one unique ObjectId and decide which document we want to keep. However, imagine if we had hundreds or thousands of documents. It would become very tedious to look for all of the groupings with more than one unique ObjectId. So, let's make it a little easier by including another field in our $group pipeline operator that will count the documents. The count field uses the $sum operator which adds the expression 1 to the total for this group for each document in the group. When used in the $group stage, $sum returns the collective sum of all the numeric values that result from applying a specified expression to each document in a group of documents that share the same group by key.

This aggregation will look like the following:

db.list.aggregate([  
    {$group: {
        _id: {CERT: "$CERT"},
        uniqueIds: {$addToSet: "$_id"},
        count: {$sum: 1}
        }
    }
]);

Running this aggregation, we'll get the number of documents that have the same CERT in the count field.

...
{ "_id" : { "CERT" : 34485 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a2383329007") ], "count" : 1 }
{ "_id" : { "CERT" : 34682 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a2383329005") ], "count" : 1 }
{ "_id" : { "CERT" : 22130 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a2383329003") ], "count" : 1 }
{ "_id" : { "CERT" : 28312 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a2383329001") ], "count" : 1 }
{ "_id" : { "CERT" : 57017 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a238332904c") ], "count" : 1 }
{ "_id" : { "CERT" : 27197 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a2383328fff") ], "count" : 1 }
{ "_id" : { "CERT" : 34979 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a238332907b") ], "count" : 1 }
{ "_id" : { "CERT" : 25231 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a2383328ffe") ], "count" : 1 }
{ "_id" : { "CERT" : 20203 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a2383328ffd") ], "count" : 1 }
{ "_id" : { "CERT" : 35030 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a2383328ffb") ], "count" : 1 }
{ "_id" : { "CERT" : 31293 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a238332908c"), ObjectId("5900b01b2ce12a2383328f23") ], "count" : 2 }
{ "_id" : { "CERT" : 2303 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a238332908d"), ObjectId("5900b01b2ce12a2383328f22") ], "count" : 2 }
...

To get only the groups that have a count of more than one, we can use the $match operator to filter our results. Within the $match pipeline operator, we'll tell it to look at the count field and tell it to look for counts greater than one using the $gt operator representing "greater than" and the number 1. This looks like the following:

 db.list.aggregate([
    {$group: {
        _id: {CERT: "$CERT"},
        uniqueIds: {$addToSet: "$_id"},
        count: {$sum: 1}
        }
    },
    {$match: { 
        count: {"$gt": 1}
        }
    }
]);

The result of this query will give us only the grouped documents that have more than one document with the same value.

{ "_id" : { "CERT" : 31293 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a238332908c"), ObjectId("5900b01b2ce12a2383328f23") ], "count" : 2 }
{ "_id" : { "CERT" : 2303 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a238332908d"), ObjectId("5900b01b2ce12a2383328f22") ], "count" : 2 }
{ "_id" : { "CERT" : 19657 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a238332908b"), ObjectId("5900b01b2ce12a2383328ed0") ], "count" : 2 }
{ "_id" : { "CERT" : 57315 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a238332908f"), ObjectId("5900b01b2ce12a238332908e"), ObjectId("5900b01b2ce12a2383328f7f") ], "count" : 3 }
{ "_id" : { "CERT" : 35312 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a238332908a"), ObjectId("5900b01b2ce12a2383328e66") ], "count" : 2 }
{ "_id" : { "CERT" : 28144 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a2383329087"), ObjectId("5900b01b2ce12a2383328ea1") ], "count" : 2 }
{ "_id" : { "CERT" : 8221 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a2383329088"), ObjectId("5900b01b2ce12a2383328fa2") ], "count" : 2 }
{ "_id" : { "CERT" : 34486 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a2383329089"), ObjectId("5900b01b2ce12a2383328f61") ], "count" : 2 } 

The filtering of the results to include CERT groupings with more than one document allows for a lot of the noise to be removed. In some cases, you may have more groupings with a lot more duplicates. Therefore, you may want to order them so that the groupings with the most duplicates appear at the top of your results, or in descending order. MongoDB provides us with the $sort pipeline operator to select the fields that we want to sort our groupings by. To sort by the count field in descending order, we add the expression -1, and now our aggregation will look like:

 db.list.aggregate([
    {$group: {
        _id: {CERT: "$CERT"},
        uniqueIds: {$addToSet: "$_id"},
        count: {$sum: 1}
        }
    },
    {$match: { 
        count: {"$gt": 1}
        }
    },
    {$sort: {
        count: -1
        }
    }
]);

Now, our results will be in descending order based on the count field.

{ "_id" : { "CERT" : 57315 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a238332908f"), ObjectId("5900b01b2ce12a238332908e"), ObjectId("5900b01b2ce12a2383328f7f") ], "count" : 3 }
{ "_id" : { "CERT" : 31293 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a238332908c"), ObjectId("5900b01b2ce12a2383328f23") ], "count" : 2 }
{ "_id" : { "CERT" : 2303 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a238332908d"), ObjectId("5900b01b2ce12a2383328f22") ], "count" : 2 }
{ "_id" : { "CERT" : 19657 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a238332908b"), ObjectId("5900b01b2ce12a2383328ed0") ], "count" : 2 }
{ "_id" : { "CERT" : 35312 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a238332908a"), ObjectId("5900b01b2ce12a2383328e66") ], "count" : 2 }
{ "_id" : { "CERT" : 28144 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a2383329087"), ObjectId("5900b01b2ce12a2383328ea1") ], "count" : 2 }
{ "_id" : { "CERT" : 8221 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a2383329088"), ObjectId("5900b01b2ce12a2383328fa2") ], "count" : 2 }
{ "_id" : { "CERT" : 34486 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a2383329089"), ObjectId("5900b01b2ce12a2383328f61") ], "count" : 2 }

Using MongoDB's aggregate method allows you to group together your documents and find duplicates in just a few lines of code. You can include other fields into the different pipeline operators to refine your search even further; for example, you can add the name of the banks to your groups adding just the $Bank Name field to your $group pipeline operator.

Of course, it would have been better to never import the duplicates in the first place. If we created a unique index on CERT before importing, then one of the duplicates would be dropped automatically. The problem there though is you don't know which document will be dropped. By deduplicating before importing or importing and deduplicating, you stand a good chance of better data quality with a usable audit trail.

Nonetheless, if we're working with data that already exists in our database then using MongoDB's aggregation pipeline gives us with a way to find duplicate documents quickly and with just a little code involved. With this knowledge, experiment on your own dataset to find duplicate documents that you may want to keep, merge, or delete altogether.


If you have any feedback about this or any other Compose article, drop the Compose Articles team a line at articles@compose.com. We're happy to hear from you.

attributionChris Barbalis

Abdullah Alger
Abdullah Alger is a former University lecturer who likes to dig into code, show people how to use and abuse technology, talk about GIS, and fish when the conditions are right. Coffee is in his DNA. Love this article? Head over to Abdullah Alger’s author page and keep reading.