I want to remove duplicate data from my collection in MongoDB. How can I accomplish this?
Please refer to this example to understand my problem:
My collection name & questions are in this col/row as follows -
{
"questionText" : "what is android ?",
"__v" : 0,
"_id" : ObjectId("540f346c3e7fc1234ffa7085"),
"userId" : "102"
},
{
"questionText" : "what is android ?",
"__v" : 0,
"_id" : ObjectId("540f346c3e7fc1054ffa7086"),
"userId" : "102"
}
How do I remove the duplicate question by the same userId? Any help?
I'm using Python and MongoDB.
IMPORTANT: The dropDups option was removed starting with MongoDB 3.x, so this solution is only valid for MongoDB versions 2.x and before. There is no direct replacement for the dropDups option. The answers to the question posed at http://stackoverflow.com/questions/30187688/mongo-3-duplicates-on-unique-index-dropdups offer some possible alternative ways to remove duplicates in Mongo 3.x.
Duplicate records can be removed from a MongoDB collection by creating a unique index on the collection and specifying the dropDups option.
Assuming the collection includes a field named record_id that uniquely identifies a record in the collection, the command to use to create a unique index and drop duplicates is:
db.collection.ensureIndex( { record_id:1 }, { unique:true, dropDups:true } )
Here is the trace of a session that shows the contents of a collection before and after creating a unique index with dropDups. Notice that duplicate records are no longer present after the index is created.
> db.pages.find()
{ “_id” : ObjectId(“52829c886602e2c8428d1d8c”), “leaf_num” : “1”, “scan_id” : “smithsoniancont251985smit”, “height” : 3464, “width” : 2548 }
{ “_id” : ObjectId(“52829c886602e2c8428d1d8d”), “leaf_num” : “1”, “scan_id” : “smithsoniancont251985smit”, “height” : 3464, “width” : 2548 }
{ “_id” : ObjectId(“52829c886602e2c8428d1d8e”), “leaf_num” : “2”, “scan_id” : “smithsoniancont251985smit”, “height” : 3587, “width” : 2503 }
{ “_id” : ObjectId(“52829c886602e2c8428d1d8f”), “leaf_num” : “2”, “scan_id” : “smithsoniancont251985smit”, “height” : 3587, “width” : 2503 }
>
> db.pages.ensureIndex( { scan_id:1, leaf_num:1 }, { unique:true, dropDups:true } )
>
> db.pages.find()
{ “_id” : ObjectId(“52829c886602e2c8428d1d8c”), “leaf_num” : “1”, “scan_id” : “smithsoniancont251985smit”, “height” : 3464, “width” : 2548 }
{ “_id” : ObjectId(“52829c886602e2c8428d1d8e”), “leaf_num” : “2”, “scan_id” : “smithsoniancont251985smit”, “height” : 3587, “width” : 2503 }
>
Since now the dropOps is deprecated. You can use pandas.
select the fields you need from mongodb
use pandas.DataFrame.duplicated to mark all duplicates as True except the first one
remove them ( the ones marked as duplicated ) in the collection using their _ids
Related
I must be really slow because I spent a whole day googling and trying to write Python code to simply list the "code" values only so my output will be Service1, Service2, Service2. I have extracted json values before from complex json or dict structure. But now I must have hit a mental block.
This is my json structure.
myjson='''
{
"formatVersion" : "ABC",
"publicationDate" : "2017-10-06",
"offers" : {
"Service1" : {
"code" : "Service1",
"version" : "1a1a1a1a",
"index" : "1c1c1c1c1c1c1"
},
"Service2" : {
"code" : "Service2",
"version" : "2a2a2a2a2",
"index" : "2c2c2c2c2c2"
},
"Service3" : {
"code" : "Service4",
"version" : "3a3a3a3a3a",
"index" : "3c3c3c3c3c3"
}
}
}
'''
#convert above string to json
somejson = json.loads(myjson)
print(somejson["offers"]) # I tried so many variations to no avail.
Or, if you want the "code" stuffs :
>>> [s['code'] for s in somejson['offers'].values()]
['Service1', 'Service2', 'Service4']
somejson["offers"] is a dictionary. It seems you want to print its keys.
In Python 2:
print(somejson["offers"].keys())
In Python 3:
print([x for x in somejson["offers"].keys()])
In Python 3 you must use the list comprehension because in Python 3 keys() is a 'view', not a list.
This should probably do the trick , if you are not certain about the number of Services in the json.
import json
myjson='''
{
"formatVersion" : "ABC",
"publicationDate" : "2017-10-06",
"offers" : {
"Service1" : {
"code" : "Service1",
"version" : "1a1a1a1a",
"index" : "1c1c1c1c1c1c1"
},
"Service2" : {
"code" : "Service2",
"version" : "2a2a2a2a2",
"index" : "2c2c2c2c2c2"
},
"Service3" : {
"code" : "Service4",
"version" : "3a3a3a3a3a",
"index" : "3c3c3c3c3c3"
}
}
}
'''
#convert above string to json
somejson = json.loads(myjson)
#Without knowing the Services:
offers = somejson["offers"]
keys = offers.keys()
for service in keys:
print(somejson["offers"][service]["code"])
I have two collections in the mongo database. At the moment I have an ID document from collection1 in document in collection2. I want to copy some values from Collection1 to nested field (dataFromCollection1) in related documents in Collection2. I'm looking for help because I can not find a solution to pass values from the mongo base fields to variables in python.
Collection1:
{
"_id" : ObjectId("583d498214f89c3f08b10e2d"),
"name" : "Name",
"gender" : "men",
"secondName" : "",
"testData" : [ ],
"numberOf" : NumberInt(0),
"place" : "",
"surname" : "Surname",
"field1" : "eggs",
"field2" : "hamm",
"field3" : "foo",
"field4" : "bar"
}
Collection2:
{
"_id" : ObjectId("58b028e26900ed21d5153a36"),
"collection1" : ObjectId("583d498214f89c3f08b10e2d")
"fieldCol2_1" : "123",
"fieldCol2_2" : "332",
"fieldCol2_3" : "133",
"dataFromCollection1" : {
"name" : " ",
"surname" : " ",
"field1" : " ",
"field2" : " ",
"field3" : " ",
"field4" : " "
}
}
I think you should use aggregate function in pymongo package, in aggregate you can use $lookup to match the key value pairs and project for projecting the required fields this question has already been asked so this pymongo - how to match on lookup? might helps you.
Then you can use the $out function in aggregate to create the new updated collection of your interest or you can also update the existing collection by using the update in pymongo.
Is there a way to select only userId field from _id embedded document?
I tried with below query
And also i wants to delete all the documents that comes as output to the below query may be in a batch of 10000 per batch by keeping database load and this operation should not hamper the database. please suggest.
Sample Data:
"_id" : {
"Path" : 0,
"TriggerName" : "T1",
"userId" : NumberLong(231),
"Date" : "02/09/2017",
"OfferType" : "NOOFFER"
},
"OfferCount" : NumberLong(0),
"OfferName" : "NoOffer",
"trgtm" : NumberLong("1486623660308"),
"trgtype" : "PREDEFINED",
"desktopTop-normal" : NumberLong(1)
query:
mongo --eval 'db.l.find({"_id.Date": {"$lt" : "03/09/2017"}},{"_id.userId":1}).limit(1).forEach(printjson)
output:
{
"_id" : {
"Path" : 0,
"TriggerName" : "T1",
"userId" : NumberLong(231),
"Date" : "02/09/2017",
"OfferType" : "NOOFFER"
}
I am having a document which is structured like this
{
"_id" : ObjectId("564c0cb748f9fa2c8cdeb20f"),
"username" : "blah",
"useremail" : "blah#blahblah.com",
"groupTypeCustomer" : true,
"addedpartners" : [
"562f1a629410d3271ba74f74",
"562f1a6f9410d3271ba74f83"
],
"groupName" : "Mojito",
"groupTypeSupplier" : false,
"groupDescription" : "A group for fashion designers"
}
Now I want to delete one of the values from this 'addedpartners' array and update the document.
I want to just delete 562f1a6f9410d3271ba74f83 from the addedpartners array
This is what I had tried earlier.
db.myCollection.update({'_id':'564c0cb748f9fa2c8cdeb20f'},{'$pull':{'addedpartners':'562f1a6f9410d3271ba74f83'}})
db.myCollection.update(
{ _id: ObjectId(id) },
{ $pull: { 'addedpartners': '562f1a629410d3271ba74f74' } }
);
Try with this
db.myCollection.update({}, {$unset : {"addedpartners.1" : 1 }})
db.myCollection.update({}, {$pull : {"addedpartners" : null}})
No way to delete array directly, i think this is going to work, i haven't tried yet.
I want to query Mongodb: find all users, that have 'artist'=Iowa in any array item of objects.
Here is Robomongo of my collection:
In Python I'm doing:
Vkuser._get_collection().find({
'member_of_group': 20548570,
'my_music': {
'items': {
'$elemMatch': {
'artist': 'Iowa'
}
}
}
})
but this returns nothing. Also tried this:
{'member_of_group': 20548570, 'my_music': {'$elemMatch': {'$.artist': 'Iowa'}}} and that didn't work.
Here is part of document with array:
"can_see_audio" : 1,
"my_music" : {
"items" : [
{
"name" : "Anastasia Plotnikova",
"photo" : "http://cs620227.vk.me/v620227451/9c47/w_okXehPbYc.jpg",
"id" : "864451",
"name_gen" : "Anastasia"
},
{
"title" : "Ain't Talkin' 'Bout Dub",
"url" : "http://cs4964.vk.me/u14671028/audios/c5b8a0735224.mp3?extra=jgV4ZQrFrsfxZCJf4gsRgnKWvdAfIqjE0M6eMtxGFpj2yp4vjs5DYgAGImPMp4mCUSUGJzoyGeh2Es6L-H51TPa3Q_Q",
"lyrics_id" : 24313846,
"artist" : "Apollo 440",
"genre_id" : 18,
"id" : 344280392,
"owner_id" : 864451,
"duration" : 279
},
{
"title" : "Animals",
"url" : "http://cs1316.vk.me/u4198685/audios/4b9e4536e1be.mp3?extra=TScqXzQ_qaEFKHG8trrwbFyNvjvJKEOLnwOWHJZl_cW5EA6K3a9vimaMpx-Yk5_k41vRPywzuThN_IHT8mbKlPcSigw",
"lyrics_id" : 166037,
"artist" : "Nickelback",
"id" : 344280351,
"owner_id" : 864451,
"duration" : 186
},
The following query should work. You can use the dot notation to query into sub documents and arrays.
Vkuser._get_collection().find({
'member_of_group': 20548570,
'my_music.items.artist':'Iowa'
})
The following query worked for me in the mongo shell
db.collection1.find({ "my_music.items.artist" : "Iowa" })