How to prevent duplicates in pymongo

How to prevent duplicates in pymongo - python

I am using pymongo "insert_one",
I want to prevent insertion of two documents with the same "name" attribute.
How do I generally prevent duplicates?
How do I config it for a specific attribute like name?
Thanks!
My code:
client = MongoClient('mongodb://localhost:8888/db')
db = client[<db>]
heights=db.heights
post_id= heights.insert_one({"name":"Tom","height":2}).inserted_id
try:
post_id2 = heights.insert_one({"name":"Tom","height":3}).inserted_id
except pymongo.errors.DuplicateKeyError, e:
print e.error_document
print post_id
print post_id2
output:
56aa7ad84f9dcee972e15fb7
56aa7ad84f9dcee972e15fb8

There is an answer for preventing addition of duplicate documents in mongoDB in general at How to stop insertion of Duplicate documents in a mongodb collection .
The idea is to use update with upsert=True instead of insert_one.
So while inserting the code for pymongo would be
db[collection_name].update(document,document,upsert=True)

You need to create an index that ensures the name is unique in that collection
e.g.
db.heights.create_index([('name', pymongo.ASCENDING)], unique=True)
Please see the official docs for further details and clarifying examples

This is your document
doc = {"key": val}
Then, use $set with your document to update
update = {"$set": doc} # it is important to use $set in your update
db[collection_name].update(document, update, upsert=True)

Related

Pymongo BulkWriteResult doesn't contain upserted_ids

Okey so currently I'm trying to upsert something in a local mongodb using pymongo.(I check to see if the document is in the db and if it is, update it, otherwise just insert it)
I'm using bulk_write to do that, and everything is working ok. The data is inserted/updated.
However, i would need the ids of the newly inserted/updated documents but the "upserted_ids" in the bulkWriteResult object is empty, even if it states that it inserted 14 documents.
I've added this screenshot with the variable. Is it a bug? or is there something i'm not aware of?
Finally, is there a way of getting the ids of the documents without actually searching for them in the db? (If possible, I would prefer to use bulk_write)
Thank you for your time.
EDIT:
As suggested, i added a part of the code so it's easier to get the general ideea:
for name in input_list:
if name not in stored_names: #completely new entry (both name and package)
operations.append(InsertOne({"name": name, "package" : [package_name]}))
if len(operations) == 0:
print ("## No new permissions to insert")
return
bulkWriteResult = _db_insert_bulk(collection_name,operations)
and the insert function:
def _db_insert_bulk(collection_name,operations_list):
return db[collection_name].bulk_write(operations_list)

The upserted_ids field in the pymongo BulkWriteResult only contains the ids of the records that have been inserted as part of an upsert operation, e.g. an UpdateOne or ReplaceOne with the upsert=True parameter set.
As you are performing InsertOne which doesn't have an upsert option, the upserted_ids list will be empty.
The lack of an inserted_ids field in pymongo's BulkWriteResult in an omission in the drivers; technically it conforms to crud specificaiton mentioned in D. SM's answer as it is annotated as "Drivers may choose to not provide this property.".
But ... there is an answer. If you are only doing inserts as part of your bulk update (and not mixed bulk operations), just use insert_many(). It is just as efficient as a bulk write and, crucially, does provide the inserted_ids value in the InsertManyResult object.
from pymongo import MongoClient
db = MongoClient()['mydatabase']
inserts = [{'foo': 'bar'}]
result = db.test.insert_many(inserts, ordered=False)
print(result.inserted_ids)
Prints:
[ObjectId('5fb92cafbe8be8a43bd1bde0')]

This functionality is part of crud specification and should be implemented by compliant drivers including pymongo. Reference pymongo documentation for correct usage.
Example in Ruby:
irb(main):003:0> c.bulk_write([insert_one:{a:1}])
=> #<Mongo::BulkWrite::Result:0x00005579c42d7dd0 #results={"n_inserted"=>1, "n"=>1, "inserted_ids"=>[BSON::ObjectId('5fb7e4b12c97a60f255eb590')]}>
Your output shows that zero documents were upserted, therefore there wouldn't be any ids associated with the upserted documents.
Your code doesn't appear to show any upserts at all, which again means you won't see any upserted ids.

insert new field in mongodb database

I'm a beginner in mongodb and pymongo and I'm working on a project where I have a students mongodb collection . What I want is to add a new field and specifically an adrress of a student to each element in my collection (the field is obviously added everywhere as null and will be filled by me later).
However when I try using this specific example to add a new field I get a the following syntax error:
client = MongoClient('mongodb://localhost:27017/') #connect to local mongodb
db = client['InfoSys'] #choose infosys database
students = db['Students']
students.update( { $set : {"address":1} } ) #set address field to every column (error happens here)
How can I fix this error?

You are using the update operation in wrong manner. Update operation is having the following syntax:
db.collection.update(
<query>,
<update>,
<options>
)
The main parameter <query> is not at all mentioned. It has to be at least empty like {}, In your case the following query will work:
db.students.update(
{}, // To update the all the documents.
{$set : {"address": 1}}, // Update the address field.
{multi: true} // To do multiple updates, otherwise Mongo will just update the first matching document.
)
So, in python, you can use update_many to achieve this. So, it will be like:
students.update_many(
{},
{"$set" : {"address": 1}}
)
You can read more about this operation here.

The previous answer here is spot on, but it looks like your question may relate more to PyMongo and how it manages updates to collections. https://pymongo.readthedocs.io/en/stable/api/pymongo/collection.html
According to the docs, it looks like you may want to use the 'update_many()' function. You will still need to make your query (all documents, in this case) as the first argument, and the second argument is the operation to perform on all records.
client = MongoClient('mongodb://localhost:27017/') #connect to local mongodb
db = client['InfoSys'] #choose infosys database
students = db['Students']
sudents.update_many({}, {$set : {"address":1}})

I solved my problem by iterating through every element in my collection and inserting the address field to each one.
cursor = students.find({})
for student in cursor :
students.update_one(student, {'$set': {'address': '1'}})

Python-Eve: Prevent inserting duplicates without using unique fields

I am trying to prevent inserting duplicate documents by the following approach:
Get a list of all documents from the desired endpoint which will contain all the documents in JSON-format. This list is called available_docs.
Use a pre_POST_<endpoint> hook in order to handle the request before inserting to the data. I am not using the on_insert hook since I need to do this before validation.
Since we can access the request object use request.json to get the payload JSON-formatted
Check if request.json is already contained in available_docs
Insert new document if it's not a duplicate only, abort otherwise.
Using this approach I got the following snippet:
def check_duplicate(request):
if not request.json in available_sims:
print('Not a duplicate')
else:
print('Duplicate')
flask.abort(422, description='Document is a duplicate and already in database.')
The available_docs list looks like this:
available_docs = [{'foo': ObjectId('565e12c58b724d7884cd02bb'), 'bar': [ObjectId('565e12c58b724d7884cd02b9'), ObjectId('565e12c58b724d7884cd02ba')]}]
The payload request.json looks like this:
{'foo': '565e12c58b724d7884cd02bb', 'bar': ['565e12c58b724d7884cd02b9', '565e12c58b724d7884cd02ba']}
As you can see, the only difference between the document which was passed to the API and the document already stored in the DB is the datatype of the IDs. Due to that fact, the if-statement in my above snippet evaluates to True and judges the document to be inserted not being a duplicate whereas it definitely is a duplicate.
Is there a way to check if a passed document is already in the database? I am not able to use unique fields since the combination of all document fields needs to be unique only. There is an unique identifier (which I left out in this example), but this is not suitable for the desired comparison since it is kind of a time stamp.
I think something like casting the given IDs at the keys foo and bar as ObjectIDs would do the trick, but I do not know how to to this since I do not know where to get the datatype ObjectID from.

You approach would be much slower than setting a unique rule for the field.
Since, from your example, you are going to compare objectids, can't you simply use those as the _id field for the collection? In Mongo (and Eve of course) that field is unique by default. Actually, you typically don't even define it. You would not need to do anything at all, as a POST of a document with an already existing id would fail right away.
If you can't go that way (maybe you need to compare a different objectid field and still, for some reason, you can't simply set a unique rule for the field), I would look at querying the db for the field value instead than getting all the documents from the db and then scanning them sequentially in code. Something like db.find({db_field: new_document_field_value}). If that returns true, new document is a duplicate. Make sure db_field is indexed (which usually holds true also for fields tagged with unique rule)
EDIT after the comments. A trivial implementation would probable be something like this:
def pre_POST_callback(resource, request):
# retrieve mongodb collection using eve connection
docs = app.data.driver.db['docs']
if docs.find_one({'foo': <value>}):
flask.abort(422, description='Document is a duplicate and already in database.')
app = Eve()
app.run()

Here's my approach on preventing duplicate records:
def on_insert_subscription(items):
c_subscription = app.data.driver.db['subscription']
user = decode_token()
if user:
for item in items:
if c_subscription.find_one({
'topic': ObjectId(item['topic']),
'client': ObjectId(user['user_id'])
}):
abort(422, description="Client already subscribed to this topic")
else:
item['client'] = ObjectId(user['user_id'])
else:
abort(401, description='Please provide proper credentials')
What I'm doing here is creating subscriptions for clients. If a client is already subscribed to a topic I throw 422.
Note: the client ID is decoded from the JWT token.

To find the _id of document other than during insertion

I have created several documents and inserted into Mongo DB.I am using python for the same . Is there a way where in I can get the _id of a particular record?
I know that we get the _id during insertion. But if i need to use it at a lateral interval is there a way I can get it by say using the find() command?

You can user the projection to get particular field from the document like this
db.collection.find({query},{_id:1}) this will return only _id
http://docs.mongodb.org/manual/reference/method/db.collection.find/

In python, you can use the find_one() method to get the document and access the _id property as following:
def get_id(value):
# Get the document with the record:
document = client.db.collection.find_one({'field': value}, {'_id': True})
return document['_id']

When you insert the record, you can specify the _id value. This has added benefits for when you're using ReplicaSets as well.
from pymongo.objectid import ObjectId
client.db.collection.insert({'_id': ObjectId(), 'key1': value....})
You could store those Ids in a list and use it later on if your requirement for needing the _id occurs immediately after insert.

How to autoincrement id at every insert in Pymongo?

I am using pymongo to insert documents in the mongodb.
here is code for router.py file
temp = db.admin_collection.find().sort( [("_id", -1)] ).limit(1)
for doc in temp:
admin_id = str(int(doc['_id']) + 1)
admin_doc ={
'_id' : admin_id,
'question' : ques,
'answer' : ans,
}
collection.insert(admin_doc)
what should i do so that at every insert _id is incremented by 1.

It doesn’t seem like a very good idea, but if you really want to go through with it you can try setup like below.
It should work good enough in a low traffic application with single server, but I wouldn't try anything like this with replicated or sharded enviroment or if you perform large amount of inserts.
Create separate collection to handle id seqs:
db.seqs.insert({
'collection' : 'admin_collection',
'id' : 0
})
Whenever you need to insert new document use something similar to this:
def insert_doc(doc):
doc['_id'] = str(db.seqs.find_and_modify(
query={ 'collection' : 'admin_collection' },
update={'$inc': {'id': 1}},
fields={'id': 1, '_id': 0},
new=True
).get('id'))
try:
db.admin_collection.insert(doc)
except pymongo.errors.DuplicateKeyError as e:
insert_doc(doc)

If you want to manually change the "_id" value you can do this by changing the _id value in the returned document. You could do this in a similar manner to the way you have proposed in your question. However I do not think this approach is advisable.
curs = db.admin_collection.find().sort( [("_id", -1)] ).limit(1)
for document in curs:
document['_id'] = str(int(document['_id']) + 1)
collection.insert(document)
It is generally not a good idea to manually make your own id's. These values have to be unique and there is no guarantee that str(int(document['_id']) + 1) will always be unique.
Instead if you want to duplicate the document you can delete the '_id' key and insert the document.
curs = db.admin_collection.find().sort( [("_id", -1)] ).limit(1)
for document in curs:
document.pop('_id',None)
collection.insert(document)
This inserts the document and allows mongo to generate the unique id.

Way late to this, but what about leaving the ObjectId alone and still adding a sequential id to use as a reference for calling the particular document(or props thereof) from a frontend api? I've been struggling to get the frontend to drop the "" on the ObjectId when fetching from the api.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.