Remove duplicate values in mongodb

Remove duplicate values in mongodb - python

I am learning mongodb using python with tornado.I have a mongodb collection, when I do
db.cal.find()
{
"Pid" : "5652f92761be0b14889d9854",
"Registration" : "TN 56 HD 6766",
"Vid" : "56543ed261be0b0a60a896c9",
"Period" : "10-2015",
"AOs": [
"14-10-2015",
"15-10-2015",
"18-10-2015",
"14-10-2015",
"15-10-2015",
"18-10-2015"
],
"Booked": [
"5-10-2015",
"7-10-2015",
"8-10-2015",
"5-10-2015",
"7-10-2015",
"8-10-2015"
],
"NA": [
"1-10-2015",
"2-10-2015",
"3-10-2015",
"4-10-2015",
"1-10-2015",
"2-10-2015",
"3-10-2015",
"4-10-2015"
],
"AOr": [
"23-10-2015",
"27-10-2015",
"23-10-2015",
"27-10-2015"
]
}
I need an operation to remove the duplicate values from the Booked,NA,AOs,AOr. Finally it should be
{
"Pid" : "5652f92761be0b14889d9854",
"Registration" : "TN 56 HD 6766",
"Vid" : "56543ed261be0b0a60a896c9",
"AOs": [
"14-10-2015",
"15-10-2015",
"18-10-2015",
],
"Booked": [
"5-10-2015",
"7-10-2015",
"8-10-2015",
],
"NA": [
"1-10-2015",
"2-10-2015",
"3-10-2015",
"4-10-2015",
],
"AOr": [
"23-10-2015",
"27-10-2015",
]
}
How do I achieve this in mongodb?

Working solution
I have created a working solution based on JavaScript, which is available on the mongo shell:
var codes = ["AOs", "Booked", "NA", "AOr"]
// Use bulk operations for efficiency
var bulk = db.dupes.initializeUnorderedBulkOp()
db.dupes.find().forEach(
function(doc) {
// Needed to prevent unnecessary operatations
changed = false
codes.forEach(
function(code) {
var values = doc[code]
var uniq = []
for (var i = 0; i < values.length; i++) {
// If the current value can not be found, it is unique
// in the "uniq" array after insertion
if (uniq.indexOf(values[i]) == -1 ){
uniq.push(values[i])
}
}
doc[code] = uniq
if (uniq.length < values.length) {
changed = true
}
}
)
// Update the document only if something was changed
if (changed) {
bulk.find({"_id":doc._id}).updateOne(doc)
}
}
)
// Apply all changes
bulk.execute()
Resulting document with your sample input:
replset:PRIMARY> db.dupes.find().pretty()
{
"_id" : ObjectId("567931aefefcd72d0523777b"),
"Pid" : "5652f92761be0b14889d9854",
"Registration" : "TN 56 HD 6766",
"Vid" : "56543ed261be0b0a60a896c9",
"Period" : "10-2015",
"AOs" : [
"14-10-2015",
"15-10-2015",
"18-10-2015"
],
"Booked" : [
"5-10-2015",
"7-10-2015",
"8-10-2015"
],
"NA" : [
"1-10-2015",
"2-10-2015",
"3-10-2015",
"4-10-2015"
],
"AOr" : [
"23-10-2015",
"27-10-2015"
]
}
Using indices with dropDups
This simply does not work. First, as per version 3.0, this option no longer exists. Since we have 3.2 released, we should find a portable way.
Second, even with dropDups, the documentation clearly states that:
dropDups boolean : MongoDB indexes only the first occurrence of a key and removes all documents from the collection that contain subsequent occurrences of that key.
So if there would be another document which has the same values in one of the billing codes as a previous one, the whole document would be deleted.

You can't use the "dropDups" syntax here first because it has been "deprecated" as of MongoDB 2.6 and removed in MongoDB 3.0 and will not even work.
To remove the duplicate from each list you need to use the set class in python.
import pymongo
fields = ['Booked', 'NA', 'AOs', 'AOr']
client = pymongo.MongoClient()
db = client.test
collection = db.cal
bulk = colllection.initialize_ordered_op()
count = 0
for document in collection.find():
update = dict(zip(fields, [list(set(document[field])) for field in fields]))
bulk.find({'_id': document['_id']}).update_one({'$set': update})
count = count + 1
if count % 200 == 0:
bulk.execute()
bulk = colllection.initialize_ordered_op()
if count > 0:
bulk.execute()
MongoDB 3.2 deprecates Bulk() and its associated methods and provides the .bulkWrite() method. This method is available from Pymongo 3.2 as bulk_write(). The first thing to do using this method is to import the UpdateOne class.
from pymongo import UpdateOne
requests = [] # list of write operations
for document in collection.find():
update = dict(zip(fields, [list(set(document[field])) for field in fields]))
requests.append(UpdateOne({'_id': document['_id']}, {'$set': update}))
collection.bulk_write(requests)
The two queries give the same and expected result:
{'AOr': ['27-10-2015', '23-10-2015'],
'AOs': ['15-10-2015', '14-10-2015', '18-10-2015'],
'Booked': ['7-10-2015', '5-10-2015', '8-10-2015'],
'NA': ['1-10-2015', '4-10-2015', '3-10-2015', '2-10-2015'],
'Period': '10-2015',
'Pid': '5652f92761be0b14889d9854',
'Registration': 'TN 56 HD 6766',
'Vid': '56543ed261be0b0a60a896c9',
'_id': ObjectId('567f808fc6e11b467e59330f')}

have you tried "Distinct()" ?
Link: https://docs.mongodb.org/v3.0/reference/method/db.collection.distinct/
Specify Query with distinct
The following example returns the distinct values for the field sku, embedded in the item field, from the documents whose dept is equal to "A":
db.inventory.distinct( "item.sku", { dept: "A" } )
The method returns the following array of distinct sku values:
[ "111", "333" ]

Assuming that you want to remove duplicate dates from the collection, so you can add a unique index with the dropDups: true option:
db.bill_codes.ensureIndex({"fieldName":1}, {unique: true, dropDups: true})
For more reference:
db.collection.ensureIndex() - MongoDB Manual 3.0
Note: Back up your database first in case it doesn't do exactly as you're expecting.

Related

MongoDB - Querying a nested boolean field

I have the following mongoclient query:
db = db.getSiblingDB("test-db");
hosts = db.getCollection("test-collection")
db.hosts.aggregate([
{$match: {"ip_str": {$in: ["52.217.105.116"]}}}
]);
Which outputs this:
{
"_id" : ObjectId("..."),
"ip_str" : "52.217.105.116",
"data" : [
{"ssl" : {"cert" : {"expired" : "False"}}}
]
}
I'm trying to build the query so it returns a boolean True or False depending on the value of the ssl.cert.expired field. I'm not quite sure how to do this though. I've had a look into the $lookup and $where operators, but am not overly familiar with querying nested objects in Mongo yet.

As the data is an array, in order to get the (first) element of the nested expired, you should work with $arrayElemAt and provide an index as 0 to indicate the first element.
{
$project: {
ip_str: 1,
expired: {
$arrayElemAt: [
"$data.ssl.cert.expired",
0
]
}
}
}
Demo # Mongo Playgound

How to find the count of the number of documents in mongodb using pymongo aggregation?

I'm trying to find the max value of a field from a number of documents and want the output to not only reflect the max value of the field but also the total count of documents that the aggregate query will retrieve.
I'm able to retrieve the "wait" field with the max value that I want with the below query, but am stuck with how to get the count of all the documents that are satisfy the below query(Match field).
db = mongo_client[_MONGO_COLLECTION]
cursor = db.aggregate(
[
{"$match": { "owner": { "$exists": False}}},
{
"$project": {
"wait" : {
"$divide": [{"$subtract": [datetime.now(), "$creationDate"]}, 1000],
}
}
},
{
"$sort" : {
"wait": -1
}
}, {"$limit" : 1}
])
for x in cursor:
print(x)

You can use count method as below:
print(cursor.count())
print(list(cursor))
or
you can add $count pipeline as below:
{
"$count":"count" // the name of count filed
}

How can I make a two dimensional structure(dictionary inside a dictionary) using python and pymongo to store in MongoDB?

I will be getting below data frequently,
item_1 = { "ip" : "66.70.175.192", "domain" : null, "date_downloaded" : "2017:08:23 12:25:05", "scanned_date_with_port" : [ { "scanned_date" : "2017:08:22 04:00:03", "port" : 25 }, { "scanned_date" : "2017:08:22 04:00:03", "port" : 110 } ], "ports" : [ 25, 110 ] }
How can I save data into pymongo in the below structure:
{"ip_1" : {"port" : [scanned_date_1, scanned_date_2]}, {"port_2" : [scanned_date_1, scanned_date_2] }, "domain_name" : ["domain"] }
{"ip_2" : {"port" : [scanned_date_1, scanned_date_2]}, {"port_2" : [scanned_date_1, scanned_date_2] }, "domain_name" : ["domain"] }
Whenever new IP comes If already exist need to append, else add new. If port is already in one Ip append the port with scanned_date else add that port and scanned_date.
How can I do it efficiently? There will be a massive data to be looped.
for item in all_items:
Each "item" will have the above structure of item_1.

What you can do is to maybe change your data structure and unify the way how are treated new IPs and IPs already in the DB in a way that you just ask for a given IP and a port and you get the structure, either partially populated or empty and you will always append the new data - either to existing list or to an empty list. You will something like a factory for this.

Use the $push update operator to append to an array. Here's a complete example:
client = MongoClient()
db = client.test
collection = db.collection
collection.delete_many({})
collection.insert_many([
{"_id": "ip_1", "port": [1, 2], "port_2": [1, 2], "domain_name": "domain"},
{"_id": "ip_2", "port": [1, 2], "port_2": [1, 2], "domain_name": "domain"},
])
# A new request comes in with address "ip_2", port "port_2", timestamp "3".
collection.update_one({
"_id": "ip_2",
}, {
"$push": {"port_2": 3}
})
import pprint
pprint.pprint(collection.find_one("ip_2"))

Finally I got the soulution.
bulkop = self.coll_historical_data.initialize_ordered_bulk_op()
for rs in resultset:
ip = rs["ip"]
scanned_date_with_port = rs["scanned_date_with_port"]
domain = rs["domain"]
for data in scanned_date_with_port:
scanned_date = data["scanned_date"]
port = data["port"]
# insert if not found, else update
retval = bulkop.find({"ip" : ip}).upsert().update({"$push" : {str(port) : scanned_date }, "$addToSet" : {"domain" : domain}} )
retval = bulkop.execute()

pymongo date field converted to unknown date format

I am getting data from one collection using python and I will be processing it and storing it in another collection. In the processed collection some of the date fields looks different like Date(-61833715200000).
I use below code to get data and processing it and then I bulk insert the values to new collection.
fleet_managers = taximongo.users.aggregate([{ "$match": { "role" : "fleet_manager"}}])
fleet_managers = pd.DataFrame(list(fleet_managers))
fleet_managers['city_id'] = fleet_managers['region_id'].map({'57ff2e84f39e0f0444000004':'Chennai','57ff2e08f39e0f0444000003':'Hyderabad'})
pros_fleet_managers.insert_many(fleet_managers.to_dict('records'))
The collection looks like this:
{
"_id" : ObjectId("58006678ee5e0e29c5000009"),
"deleted_at" : NaN,
"region_id" : "57ff2e84f39e0f0444000004",
"reset_password_sent_at" : Date(-61833715200000),
"current_sign_in_at" : ISODate("2016-10-14T06:07:55.568Z"),
"last_sign_in_at" : ISODate("2016-10-14T06:07:45.574Z"),
"remember_created_at" : Date(-61833715200000)
}
What did do wrong here. Thanks already.
I have found the solution by using the $ifNull while projecting the fields.
fleet_managers = taximongo.users.aggregate([{ "$match": { "role" : "fleet_manager"}},{"$project":{'_id':1,'deleted_at':{ "$ifNull": [ "$deleted_at", "null" ] },
'reset_password_sent_at':{ "$ifNull": [ "$reset_password_sent_at", "null" ] }, 'region_id':1,'current_sign_in_at':1,'last_sign_in_at':1,'remember_created_at':{ "$ifNull": [ "$remember_created_at", "null" ] }}}])
fleet_managers = pd.DataFrame(list(fleet_managers))
fleet_managers['city_id'] = fleet_managers['region_id'].map({'57ff2e84f39e0f0444000004':'Chennai','57ff2e08f39e0f0444000003':'Hyderabad'})
pros_fleet_managers.insert_many(fleet_managers.to_dict('records'))
The above code gives the solution but I need to handle the null or non existence dynamically i.e., when fetching it from the source collection.
Help me out on this.

pyMongo iterate over cursor object with subitems

The function below searches a collection with a subitem projects. If there is a subitem with isManager set to 1 it should return True otherwise it will always return False.
def isMasterProject(self, pid, uid):
masterProjects = False
proj = self.collection.find({ "_id": uid, "projects": { '$elemMatch': { "projectId": _byid(pid), "isManager": 1 } } })
for value in proj:
if str(value['projects']['projectId']) == pid:
if value['projects']['isManager'] == 1:
masterProjects = True
return masterProjects
_byid is equivalent to ObjectId
It always seem to return False. Here's an example of a collection.
{
"_id" : ObjectId("52cf683306bcfc7be96a4d89"),
"firstName" : "Test",
"lastName" : "User",
"projects" : [
{
"projectId" : ObjectId("514f593c06bcfc1e96f619be"),
"isManager" : 0
},
{
"projectId" : ObjectId("511e3ed0909706a6a188953d"),
"isManager" : 1
},
{
"projectId" : ObjectId("51803baf06bcfc149116bf62"),
"isManager" : 1
},
{
"projectId" : ObjectId("514362bf121f92fb6867e58f"),
"isManager" : 1
}
],
"user" : "test.user#example.com",
"userType" : "Basic"
}
Would it be simpler to check for an empty cursor and if so how would I do that?

How about:
obj = next(proj, None)
if obj:
$elemMatch should only return results if the criteria given match a document so you should only return a cursor from find where your criteria are true.
Since you are using _id in the query and only ever expect to get one result, why not use findOne and shortcut one step.
Another gotcha for the new initiates, be aware you are returning the whole document here and not some representation with only the matching element of the array. Things that did not match will still be there, and then expecting different results by iterating over these will lead you to grief.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Remove duplicate values in mongodb - python

Related

MongoDB - Querying a nested boolean field

How to find the count of the number of documents in mongodb using pymongo aggregation?

How can I make a two dimensional structure(dictionary inside a dictionary) using python and pymongo to store in MongoDB?

pymongo date field converted to unknown date format

pyMongo iterate over cursor object with subitems

Categories

Resources