Finding Duplicate Documents in MongoDB
PublishedNeed to find duplicate documents in your MongoDB database? This article will show you how to find duplicate documents in your existing database using MongoDB's aggregation pipeline.
Finding duplicate values in your database can be difficult, especially if you have millions of documents to look at. MongoDB's aggregation pipeline makes finding duplicate documents easier by allowing you to customize how documents are grouped together and filtered. In other words, MongoDB lets you select fields and group together documents based on your selection in order to find duplicate documents.
The dataset that we will be using is a CSV file containing a list of 550 failed banks compiled by the Federal Deposit Insurance Corporation (FDIC). We've altered the list for this article to include duplicate banks, so our list now includes 559 banks which can be downloaded here.
After downloading the CSV file, let's import the file into our Compose MongoDB deployment. To insert the CSV file, we'll be using mongoimport
to create a database called banks and a collection called list. Make sure to use the credentials for your deployment.
mongoimport --host aws-us-west-2-portal.0.dblayer.com --port 99999 --db banks --collection list --ssl --sslAllowInvalidCertificates -u user -p mypass --type csv --headerline --file banklist.csv
One you've run mongoimport
in the terminal, you should see that 559 documents have been successfully imported into your database. Now, let's log into our MongoDB shell and look at our documents.
Working with the Bank Data
In the MongoDB shell, our data should look similar to this:
db.list.findOne();
{
"_id" : ObjectId("5900b01b2ce12a2383328e61"),
"Bank Name" : "Seaway Bank and Trust Company",
"City" : "Chicago",
"ST" : "IL",
"CERT" : 19328,
"Acquiring Institution" : "State Bank of Texas",
"Closing Date" : "27-Jan-17",
"Updated Date" : "17-Feb-17"
}
Each document is provided with a unique ObjectId by MongoDB and also a unique CERT
number provided by the FDIC. The CERT number is unique to each branch of a bank. The CERT number is the number that we'll be using to identify duplicate banks. We won't be looking for duplicate Bank Name
s since a bank can have multiple branches and have the same name.
Now, let's start looking for duplicate banks in our dataset using MongoDB's aggregation pipeline operators.
Finding Duplicates with Aggregations
You can find duplicate values within your MongoDB database using the aggregate
method along with the $group
and $match
aggregation pipeline operators. For a closer look at MongoDB's aggregation pipeline operators see the article Aggregations in MongoDB by Example.
Let's start out with how to use the aggregate method in MongoDB. We'll apply the aggregate function to our list
like this:
db.list.aggregate([ ... ]);
The aggregate function takes an array of aggregation operators. The $group
operator is the first one we want to use, and we want to set it up so it groups our documents by the CERT number.
db.list.aggregate([
{$group: {
_id: {CERT: "$CERT"}
}
}
]);
We'll need an _id
field first, which is mandatory because it indicates what we are grouping by. In this case, we are grouping together the CERT numbers so $CERT
is used to point to the CERT
field. The key CERT
within _id
can have any name.
Running the aggregation using only the $group
pipeline operator, we'll get something like the following below from the MongoDB shell. However, the $group
pipeline operator doesn't order documents so you may get varying results:
{ "_id" : { "CERT" : 6629 } }
{ "_id" : { "CERT" : 34264 } }
{ "_id" : { "CERT" : 32646 } }
{ "_id" : { "CERT" : 22002 } }
{ "_id" : { "CERT" : 24382 } }
{ "_id" : { "CERT" : 19183 } }
{ "_id" : { "CERT" : 33784 } }
{ "_id" : { "CERT" : 16445 } }
{ "_id" : { "CERT" : 27094 } }
...
In order to view documents that have the same CERT number, we'll need to collect them together and show their IDs. To do that, we can create another field to preserve the IDs as we move through the pipeline. For now, let's name it uniqueIds
, but you can name it whatever you'd like. What we'll do is associate a CERT with all of the MongoDB documents that have the same CERT number in an array. The array will be created with the $addToSet
operator and it will include all the ObjectIds of the documents with the same CERT number using each document's _id
field represented by $_id
. If there is more than one ObjectId associated with a CERT number in the uniqueIds
array, then we've found a duplicate.
db.list.aggregate([
{$group: {
_id: {CERT: "$CERT"},
uniqueIds: {$addToSet: "$_id"}
}
}
]);
The result of this query looks like the following:
...
{ "_id" : { "CERT" : 34485 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a2383329007") ] }
{ "_id" : { "CERT" : 34682 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a2383329005") ] }
{ "_id" : { "CERT" : 22130 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a2383329003") ] }
{ "_id" : { "CERT" : 28312 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a2383329001") ] }
{ "_id" : { "CERT" : 57017 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a238332904c") ] }
{ "_id" : { "CERT" : 27197 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a2383328fff") ] }
{ "_id" : { "CERT" : 34979 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a238332907b") ] }
{ "_id" : { "CERT" : 25231 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a2383328ffe") ] }
{ "_id" : { "CERT" : 20203 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a2383328ffd") ] }
{ "_id" : { "CERT" : 35030 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a2383328ffb") ] }
{ "_id" : { "CERT" : 31293 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a238332908c"), ObjectId("5900b01b2ce12a2383328f23") ] }
{ "_id" : { "CERT" : 2303 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a238332908d"), ObjectId("5900b01b2ce12a2383328f22") ] }
...
At this point, we could stop and check to see which groupings have more than one unique ObjectId and decide which document we want to keep. However, imagine if we had hundreds or thousands of documents. It would become very tedious to look for all of the groupings with more than one unique ObjectId. So, let's make it a little easier by including another field in our $group
pipeline operator that will count
the documents. The count
field uses the $sum
operator which adds the expression 1
to the total for this group for each document in the group. When used in the $group
stage, $sum
returns the collective sum of all the numeric values that result from applying a specified expression to each document in a group of documents that share the same group by key.
This aggregation will look like the following:
db.list.aggregate([
{$group: {
_id: {CERT: "$CERT"},
uniqueIds: {$addToSet: "$_id"},
count: {$sum: 1}
}
}
]);
Running this aggregation, we'll get the number of documents that have the same CERT
in the count
field.
...
{ "_id" : { "CERT" : 34485 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a2383329007") ], "count" : 1 }
{ "_id" : { "CERT" : 34682 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a2383329005") ], "count" : 1 }
{ "_id" : { "CERT" : 22130 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a2383329003") ], "count" : 1 }
{ "_id" : { "CERT" : 28312 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a2383329001") ], "count" : 1 }
{ "_id" : { "CERT" : 57017 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a238332904c") ], "count" : 1 }
{ "_id" : { "CERT" : 27197 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a2383328fff") ], "count" : 1 }
{ "_id" : { "CERT" : 34979 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a238332907b") ], "count" : 1 }
{ "_id" : { "CERT" : 25231 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a2383328ffe") ], "count" : 1 }
{ "_id" : { "CERT" : 20203 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a2383328ffd") ], "count" : 1 }
{ "_id" : { "CERT" : 35030 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a2383328ffb") ], "count" : 1 }
{ "_id" : { "CERT" : 31293 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a238332908c"), ObjectId("5900b01b2ce12a2383328f23") ], "count" : 2 }
{ "_id" : { "CERT" : 2303 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a238332908d"), ObjectId("5900b01b2ce12a2383328f22") ], "count" : 2 }
...
To get only the groups that have a count
of more than one, we can use the $match operator to filter our results. Within the $match
pipeline operator, we'll tell it to look at the count
field and tell it to look for counts greater than one using the $gt
operator representing "greater than" and the number 1
. This looks like the following:
db.list.aggregate([
{$group: {
_id: {CERT: "$CERT"},
uniqueIds: {$addToSet: "$_id"},
count: {$sum: 1}
}
},
{$match: {
count: {"$gt": 1}
}
}
]);
The result of this query will give us only the grouped documents that have more than one document with the same value.
{ "_id" : { "CERT" : 31293 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a238332908c"), ObjectId("5900b01b2ce12a2383328f23") ], "count" : 2 }
{ "_id" : { "CERT" : 2303 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a238332908d"), ObjectId("5900b01b2ce12a2383328f22") ], "count" : 2 }
{ "_id" : { "CERT" : 19657 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a238332908b"), ObjectId("5900b01b2ce12a2383328ed0") ], "count" : 2 }
{ "_id" : { "CERT" : 57315 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a238332908f"), ObjectId("5900b01b2ce12a238332908e"), ObjectId("5900b01b2ce12a2383328f7f") ], "count" : 3 }
{ "_id" : { "CERT" : 35312 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a238332908a"), ObjectId("5900b01b2ce12a2383328e66") ], "count" : 2 }
{ "_id" : { "CERT" : 28144 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a2383329087"), ObjectId("5900b01b2ce12a2383328ea1") ], "count" : 2 }
{ "_id" : { "CERT" : 8221 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a2383329088"), ObjectId("5900b01b2ce12a2383328fa2") ], "count" : 2 }
{ "_id" : { "CERT" : 34486 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a2383329089"), ObjectId("5900b01b2ce12a2383328f61") ], "count" : 2 }
The filtering of the results to include CERT groupings with more than one document allows for a lot of the noise to be removed. In some cases, you may have more groupings with a lot more duplicates. Therefore, you may want to order them so that the groupings with the most duplicates appear at the top of your results, or in descending order. MongoDB provides us with the $sort
pipeline operator to select the fields that we want to sort our groupings by. To sort by the count
field in descending order, we add the expression -1
, and now our aggregation will look like:
db.list.aggregate([
{$group: {
_id: {CERT: "$CERT"},
uniqueIds: {$addToSet: "$_id"},
count: {$sum: 1}
}
},
{$match: {
count: {"$gt": 1}
}
},
{$sort: {
count: -1
}
}
]);
Now, our results will be in descending order based on the count
field.
{ "_id" : { "CERT" : 57315 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a238332908f"), ObjectId("5900b01b2ce12a238332908e"), ObjectId("5900b01b2ce12a2383328f7f") ], "count" : 3 }
{ "_id" : { "CERT" : 31293 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a238332908c"), ObjectId("5900b01b2ce12a2383328f23") ], "count" : 2 }
{ "_id" : { "CERT" : 2303 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a238332908d"), ObjectId("5900b01b2ce12a2383328f22") ], "count" : 2 }
{ "_id" : { "CERT" : 19657 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a238332908b"), ObjectId("5900b01b2ce12a2383328ed0") ], "count" : 2 }
{ "_id" : { "CERT" : 35312 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a238332908a"), ObjectId("5900b01b2ce12a2383328e66") ], "count" : 2 }
{ "_id" : { "CERT" : 28144 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a2383329087"), ObjectId("5900b01b2ce12a2383328ea1") ], "count" : 2 }
{ "_id" : { "CERT" : 8221 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a2383329088"), ObjectId("5900b01b2ce12a2383328fa2") ], "count" : 2 }
{ "_id" : { "CERT" : 34486 }, "uniqueIds" : [ ObjectId("5900b01b2ce12a2383329089"), ObjectId("5900b01b2ce12a2383328f61") ], "count" : 2 }
Using MongoDB's aggregate method allows you to group together your documents and find duplicates in just a few lines of code. You can include other fields into the different pipeline operators to refine your search even further; for example, you can add the name of the banks to your groups adding just the $Bank Name
field to your $group
pipeline operator.
Of course, it would have been better to never import the duplicates in the first place. If we created a unique index on CERT before importing, then one of the duplicates would be dropped automatically. The problem there though is you don't know which document will be dropped. By deduplicating before importing or importing and deduplicating, you stand a good chance of better data quality with a usable audit trail.
Nonetheless, if we're working with data that already exists in our database then using MongoDB's aggregation pipeline gives us with a way to find duplicate documents quickly and with just a little code involved. With this knowledge, experiment on your own dataset to find duplicate documents that you may want to keep, merge, or delete altogether.
If you have any feedback about this or any other Compose article, drop the Compose Articles team a line at articles@compose.com. We're happy to hear from you.
attributionChris Barbalis